Tsurgeon v4.2.0 - 2020-11-17 ---------------------------------------------- Copyright (c) 2003-2020 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. Original core Tregex code by Roger Levy and Galen Andrew. Original core Tsurgeon code by Roger Levy. GUI by Anna Rafferty. Support code, additional features, etc. by Chris Manning. This release prepared by John Bauer. This package contains Tregex and Tsurgeon. Tregex is a Tgrep2-style utility for matching patterns in trees. It can be run in a graphical user interface, from the command line using the TregexPattern main method, or used programmatically in java code via the TregexPattern, TregexMatcher and TregexPatternCompiler classes. As of version 1.2, the Tsurgeon tree-transformation utility is bundled together with Tregex. See the file README.tsurgeon for details. Java version 1.8+ is required to use the current version of Tregex. TSURGEON ---------------------------------------------- Tsurgeon is a tool for modifying trees that match a particular Tregex pattern. Further documentation for Tregex and Tregex GUI can be found in README-tregex.txt and README-gui.txt, respectively. ---------------------------------------------- Brief description: Takes some trees, tries to match one or more tregex expressions to each tree, and for each successful match applies some surgical operations to the tree. Pretty-prints each resulting tree (after all successful match/operation sets have applied) to standard output. A simple example: ./tsurgeon.csh -treeFile atree exciseNP renameVerb ----------------------------------------- RUNNING TREGEX ----------------------------------------- Program Command Line Options: -treeFile specify the name of the file that has the trees you want to transform. -po Apply a single operation to every tree using the specified match pattern and the specified operation. -s Prints the output trees one per line, instead of pretty-printed. The arguments are then Tsurgeon scripts. Each argument should be the name of a transformation file that contains a list of pattern and transformation operation list pairs. That is, it is a sequence of pairs of a TregexPattern pattern on one or more lines, then a blank line (empty or whitespace), then a list of transformation operations one per line (as specified by Tsurgeon syntax below to apply when the pattern is matched, and then another blank line (empty or whitespace). Note the need for blank lines: The code crashes if they are not present as separators (although the blank line at the end of the file can be omitted). The script file can include comment lines, either whole comment lines or trailing comments introduced by %, which extend to the end of line. A needed percent mark in patterns or operations can be escaped by a preceding backslash. ----------------------------------------- TSURGEON SYNTAX ----------------------------------------- Legal operation syntax and semantics (see Examples section for further detail): delete ... For each name_i, deletes the node it names and everything below it. prune ... For each name_i, prunes out the node it names. Pruning differs from deletion in that if pruning a node causes its parent to have no children, then the parent is in turn pruned too. excise The name1 node should either dominate or be the same as the name2 node. This excises out everything from name1 to name2. All the children of name2 go into the parent of name1, where name1 was. relabel Relabels the node to have the new label. There are three possible forms for the new-label: relabel nodeX VP - for changing a node label to an alphanumeric string, relabel nodeX /''/ - for relabeling a node to something that isn't a valid identifier without quoting, and relabel nodeX /^VB(.*)$/verb\/$1/ - for regular expression based relabeling. In the last case, all matches of the regular expression against the node label are replaced with the replacement String. This has the semantics of Java/Perl's replaceAll: you may use capturing groups and put them in replacements with $n. Also, as in the example, you can escape a slash in the middle of the second and third forms with \/ and \\. This last version lets you make a new label that is an arbitrary String function of the original label and additional characters that you supply. insert insert inserts the named node, or a manually specified tree (see below for syntax), into the position specified. Right now the only ways to specify position are: $+ the left sister of the named node $- the right sister of the named node >i the i_th daughter of the named node. >-i the i_th daughter, counting from the right, of the named node. move moves the named node into the specified position. To be precise, it deletes (*NOT* prunes) the node from the tree, and re-inserts it into the specified position. See above for how to specify position replace deletes name1 and inserts a copy of name2 in its place. adjoin adjoins the specified auxiliary tree (see below for syntax) into the target node specified. The daughters of the target node will become the daughters of the foot of the auxiliary tree. adjoinH similar to adjoin, but preserves the target node and makes it the root of . (It is still accessible as name. The root of the auxiliary tree is ignored.) adjoinF similar to adjoin, but preserves the target node and makes it the foot of . (It is still accessible as name, and retains its status as parent of its children. The foot of the auxiliary tree is ignored.) coindex ... Puts a (Penn Treebank style) coindexation suffix of the form "-N" on each of nodes name_1 through name_m. The value of N will be automatically generated in reference to the existing coindexations in the tree, so that there is never an accidental clash of indices across things that are not meant to be coindexed. ----------------------------------------- Syntax for trees to be inserted or adjoined: A tree to be adjoined in can be specified with LISP-like parenthetical-bracketing tree syntax such as those used for the Penn Treebank. For example, for the NP "the dog" to be inserted you might use the syntax (NP (Det the) (N dog)) That's all that there is for a tree to be inserted. Auxiliary trees (a la Tree Adjoining Grammar) must also have exactly one frontier node ending in the character "@", which marks it as the "foot" node for adjunction. Final instances of the character "@" in terminal node labels will be removed from the actual label of the tree. For example, if you wanted to adjoin the adverb "breathlessly" into a VP, you might specify the following auxiliary tree: (VP (Adv breathlessly) VP@ ) All other instances of "@" in terminal nodes must be escaped (i.e., appear as \@); this escaping will be removed by tsurgeon. In addition, any node of a tree can be named (the same way as in tregex), by appending = to the node label. That name can be referred to by subsequent tsurgeon operations triggered by the same match. All other instances of "=" in node labels must be escaped (i.e., appear as \=); this escaping will be removed by tsurgeon. For example, if you want to insert an NP trace somewhere and coindex it with a node named "antecedent" you might say insert (NP (-NONE- *T*=trace)) coindex trace antecedent $ ----------------------------------------- Examples of Tsurgeon operations: Tree (used in all examples): (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) Apply delete: VP < PP=prep delete prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (. .))) The PP node directly dominated by a VP is removed, as is everything under it. Apply prune: S < (NP < NNP=noun) prune noun Result: (ROOT (S (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) The NNP node is removed, and since this results in the NP above it having no terminal children, the NP node is deleted as well. Note: This is different from delete in which the NP above the NNP would remain. Apply excise: VP < PP=prep excise prep prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (IN in) (NP (NNP May))))) (. .))) The PP node is removed, and all of its children are added in the place it was previously located. Excise removes all the nodes from the first named node to the second named node, and the children of the second node are added as children of the parent of the first node. Thus, for another example: VP=verb < PP=prep excise verb prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (IN in) (NP (NNP May))) (. .))) Apply relabel: VP=v < PP=prep relabel prep verbPrep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (verbPrep (IN in) (NP (NNP May))))) (. .))) The label for the node called prep (PP) is changed to verbPrep. The other form of relabel uses regular expressions; consider the following operation: /^VB.+/=v relabel v /^VB(.*)$/ #1 Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (D was) (VP (N arrested) (PP (IN in) (NP (NNP May))))) (. .))) The Tregex pattern matches all nodes that begin "VB" and have at least one more character. The Tsurgeon operation then matches the node label to the regular expression "^VB(.*)$" and selects the text matching the first part that is not completely specified in the pattern. In this case, that is the part matching the wildcard (.*), which matches all characters after the VB. The node is then relabeled with that part of the text, causing, for example, "VBD" to be relabeled "D". The "#1" specifies that the name of the node should be the first group in the regex. Apply insert (shown here with inserting a node, but could also be a tree): S < (NP < (NNP=name !$- DET)) insert (DET Ms.) $+ name Result: (ROOT (S (NP (DET Ms.) (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) The pattern matches the NNP node that is directly dominated by an NP (which is directly dominated by an S) and is not a direct right sister of a DET. Thus, the (DET Ms.) node is inserted immediately to the left of that NNP node, as specified by "$+ name". "$+" is the location and "name" describes what node the location is with respect to. Note: Tsurgeon will re-search for matches after each run of the script; thus, cycles may occur, causing the program to not terminate. The key is to write patterns that match prior to the changes you would like to make but that do not match afterwards. If the clause "!$- DET" had been left out in this example, Tsurgeon would have matched the pattern after every insert operation, causing an infinite number of DETs to be added. Apply move: VP=verb < PP=prep move prep $- verb Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested))) (PP (IN in) (NP (NNP May))) (. .))) The PP is moved out of the VP that dominates it and added as a direct right sister of the VP. As for insert, "$-" specifies the location for prep while "verb" specifies what that location is relative to. Note: "move" is a macro operation that deletes the given node and then inserts it. "move" does not use prune, and thus any branches that now lack terminals will remain rather than being removed. Apply replace: S < (NP=name < NNP) replace name (NP (DET A) (NN woman)) Result: (ROOT (S (NP (DET A) (NN woman)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) "name" is matched to an NP that is dominated by an S and dominates an NNP, and a new subtree ("(NP (DET A) (NN woman))") is added in the place where "name" was. Note: This operation is vulnerable to falling into an infinite loop. See the note concerning the "insert" operation and how patterns are matched. Apply adjoin: S < (NP=name < NNP) adjoin (NP (DET A) (NN woman) NP@) name Result: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) First, the NP is matched to the NP dominating the NNP tag. Then, the specified tree ("(NP (DET A) (NN woman) NP@)") is placed in that location. The "@" symbol specifies that the children of the original NP node ("name") are to be placed as children of a new NP node that is directly to the right of (NN woman). If the specified tree were "(NP (DET A) (NN woman) VP@)" then the child (NNP Maria_Eugenia_Ochoa_Garcia) would appear under a VP. Exactly one "@" node must appear in the specified tree in order to indicate where to place the node from the original tree. Apply adjoinH: S < (NP=name < NNP) adjoinH ((NP (DET A) (NN woman) NP@)) name Result: (ROOT (S (NP (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia)))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) This operation differs from adjoin in that it retains the named node (in this case, "name"). The named node is made the root of the specified tree, resulting in two NP nodes dominating the DET in this example whereas only one was present in the previous example. Note that the specified tree is wrapped in an extra pair of parentheses in order to show the syntax for retaining the named node. If the extra parentheses were not there and the specified tree was, for example, (VP (DET A) (NN woman) NP@), the VP would be ignored in order to retain an NP as the root. Thus, in this case, "adjoinH (VP (DET A) (NN woman) NP@) name" and "adjoinH ((DET A) (NN woman) NP@) name" both produce the same tree: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) Apply adjoinF: S < (NP=name < NNP) adjoinF (NP(DET A) (NN woman) @) name Result: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) This operation is very similar to adjoin and adjoinH, but this time the original named node ("name" in this case) is maintained as the root of the subtree that is adjoined. Thus, no node label needs to be given in front of the "@" and if one is given, it will be ignored. For instance, "adjoinF (NP(DET A) (NN woman) VP@) name" would still produce the same tree as above, despite the VP preceding the @. Apply coindex: NP=node < NNP=name coindex node name Result: (ROOT (S (NP-1 (NNP-1 Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP-2 (NNP-2 May))))) (. .))) This causes the named nodes to be numbered such that all nodes that are part of the same match have the same number and all matches have distinct new names. We had two instances of an NP dominating an NNP in this example, and they were renamed such that NP-i < NNP-i for each match, with 1 <= i <= number of matches. ----------------------------------------- TSURGEON SCRIPTS ----------------------------------------- Script format: Tsurgeon scripts are a combination of a Tregex pattern to match and a series of Tsurgeon operations to perform on that match. The first line of a Tsurgeon script should be the Tregex pattern. This should be followed by a blank line, and then each subsequent line may contain one Tsurgeon operation. Tsurgeon operations should not be separated by blank lines. The following is an example of correctly formatted script: S < NP=node < NNP=name relabel node NP_NAME coindex node name Comments: The character % introduces a comment that extends to the end of the line. All other intended uses of % must be escaped as \% . ----------------------------------------- CONTACT ----------------------------------------- For questions about this distribution, please contact Stanford's JavaNLP group at parser-support@lists.stanford.edu. We provide assistance on a best-effort basis. ----------------------------------------- LICENSE ----------------------------------------- Tregex, Tsurgeon, and Interactive Tregex Copyright (c) 2003-2011 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/ . For more information, bug reports, fixes, contact: Christopher Manning Dept of Computer Science, Gates 2A Stanford CA 94305-9020 USA parser-support@lists.stanford.edu http://nlp.stanford.edu/software/tregex.html