CofeehousePy/services/corenlp/doc/tsurgeon/README-tsurgeon.txt

527 lines
18 KiB
Plaintext

Tsurgeon v4.2.0 - 2020-11-17
----------------------------------------------
Copyright (c) 2003-2020 The Board of Trustees of
The Leland Stanford Junior University. All Rights Reserved.
Original core Tregex code by Roger Levy and Galen Andrew.
Original core Tsurgeon code by Roger Levy.
GUI by Anna Rafferty.
Support code, additional features, etc. by Chris Manning.
This release prepared by John Bauer.
This package contains Tregex and Tsurgeon.
Tregex is a Tgrep2-style utility for matching patterns in trees. It can
be run in a graphical user interface, from the command line using the
TregexPattern main method, or used programmatically in java code via the
TregexPattern, TregexMatcher and TregexPatternCompiler classes.
As of version 1.2, the Tsurgeon tree-transformation utility is bundled
together with Tregex. See the file README.tsurgeon for details.
Java version 1.8+ is required to use the current version of Tregex.
TSURGEON
----------------------------------------------
Tsurgeon is a tool for modifying trees that match a particular Tregex
pattern. Further documentation for Tregex and Tregex GUI can be found in
README-tregex.txt and README-gui.txt, respectively.
----------------------------------------------
Brief description:
Takes some trees, tries to match one or more tregex expressions to
each tree, and for each successful match applies some surgical
operations to the tree. Pretty-prints each resulting tree (after all
successful match/operation sets have applied) to standard output.
A simple example:
./tsurgeon.csh -treeFile atree exciseNP renameVerb
-----------------------------------------
RUNNING TREGEX
-----------------------------------------
Program Command Line Options:
-treeFile <filename>
specify the name of the file that has the trees you want to transform.
-po <matchPattern> <operation>
Apply a single operation to every tree using the specified match
pattern and the specified operation.
-s
Prints the output trees one per line, instead of pretty-printed.
The arguments are then Tsurgeon scripts.
Each argument should be the name of a transformation file that contains a list of pattern
and transformation operation list pairs. That is, it is a sequence of pairs of a
TregexPattern pattern on one or more lines, then a
blank line (empty or whitespace), then a list of transformation operations one per line
(as specified by Tsurgeon syntax below to apply when the pattern is matched,
and then another blank line (empty or whitespace).
Note the need for blank lines: The code crashes if they are not present as separators
(although the blank line at the end of the file can be omitted).
The script file can include comment lines, either whole comment lines or
trailing comments introduced by %, which extend to the end of line. A needed percent
mark in patterns or operations can be escaped by a preceding backslash.
-----------------------------------------
TSURGEON SYNTAX
-----------------------------------------
Legal operation syntax and semantics (see Examples section for further detail):
delete <name_1> <name_2> ... <name_m>
For each name_i, deletes the node it names and everything below it.
prune <name_1> <name_2> ... <name_m>
For each name_i, prunes out the node it names. Pruning differs from
deletion in that if pruning a node causes its parent to have no
children, then the parent is in turn pruned too.
excise <name1> <name2>
The name1 node should either dominate or be the same as the name2
node. This excises out everything from name1 to name2. All the
children of name2 go into the parent of name1, where name1 was.
relabel <name> <new-label>
Relabels the node to have the new label. There are three possible forms
for the new-label:
relabel nodeX VP - for changing a node label to an alphanumeric
string, relabel nodeX /''/ - for relabeling a node to something that
isn't a valid identifier without quoting, and relabel nodeX
/^VB(.*)$/verb\/$1/ - for regular expression based relabeling. In the
last case, all matches of the regular expression against the node
label are replaced with the replacement String. This has the semantics
of Java/Perl's replaceAll: you may use capturing groups and put them
in replacements with $n. Also, as in the example, you can escape a
slash in the middle of the second and third forms with \/ and \\.
This last version lets you make a new label that is an arbitrary
String function of the original label and additional characters that
you supply.
insert <name> <position>
insert <tree> <position>
inserts the named node, or a manually specified tree (see below for
syntax), into the position specified. Right now the only ways to
specify position are:
$+ <name> the left sister of the named node
$- <name> the right sister of the named node
>i <name> the i_th daughter of the named node.
>-i <name> the i_th daughter, counting from the right, of the named node.
move <name> <position>
moves the named node into the specified position. To be precise, it
deletes (*NOT* prunes) the node from the tree, and re-inserts it
into the specified position. See above for how to specify position
replace <name1> <name2>
deletes name1 and inserts a copy of name2 in its place.
adjoin <tree> <target-node>
adjoins the specified auxiliary tree (see below for syntax) into the
target node specified. The daughters of the target node will become
the daughters of the foot of the auxiliary tree.
adjoinH <tree> <target-node>
similar to adjoin, but preserves the target node and makes it the root
of <tree>. (It is still accessible as <code>name</code>. The root of
the auxiliary tree is ignored.)
adjoinF <tree> <target-node>
similar to adjoin, but preserves the target node and makes it the foot
of <tree>. (It is still accessible as <code>name</code>, and retains
its status as parent of its children. The foot of the auxiliary tree
is ignored.)
coindex <name_1> <name_2> ... <name_m>
Puts a (Penn Treebank style) coindexation suffix of the form "-N" on
each of nodes name_1 through name_m. The value of N will be
automatically generated in reference to the existing coindexations
in the tree, so that there is never an accidental clash of
indices across things that are not meant to be coindexed.
-----------------------------------------
Syntax for trees to be inserted or adjoined:
A tree to be adjoined in can be specified with LISP-like
parenthetical-bracketing tree syntax such as those used for the Penn
Treebank. For example, for the NP "the dog" to be inserted you might
use the syntax
(NP (Det the) (N dog))
That's all that there is for a tree to be inserted. Auxiliary trees
(a la Tree Adjoining Grammar) must also have exactly one frontier node
ending in the character "@", which marks it as the "foot" node for
adjunction. Final instances of the character "@" in terminal node labels
will be removed from the actual label of the tree.
For example, if you wanted to adjoin the adverb "breathlessly" into a
VP, you might specify the following auxiliary tree:
(VP (Adv breathlessly) VP@ )
All other instances of "@" in terminal nodes must be escaped (i.e.,
appear as \@); this escaping will be removed by tsurgeon.
In addition, any node of a tree can be named (the same way as in
tregex), by appending =<name> to the node label. That name can be
referred to by subsequent tsurgeon operations triggered by the same
match. All other instances of "=" in node labels must be escaped
(i.e., appear as \=); this escaping will be removed by tsurgeon. For
example, if you want to insert an NP trace somewhere and coindex it
with a node named "antecedent" you might say
insert (NP (-NONE- *T*=trace)) <node-location>
coindex trace antecedent $
-----------------------------------------
Examples of Tsurgeon operations:
Tree (used in all examples):
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
Apply delete:
VP < PP=prep
delete prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(. .)))
The PP node directly dominated by a VP is removed, as is
everything under it.
Apply prune:
S < (NP < NNP=noun)
prune noun
Result:
(ROOT
(S
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The NNP node is removed, and since this results in the NP above it
having no terminal children, the NP node is deleted as well.
Note: This is different from delete in which the NP above the NNP
would remain.
Apply excise:
VP < PP=prep
excise prep prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(IN in)
(NP (NNP May)))))
(. .)))
The PP node is removed, and all of its children are added in the
place it was previously located. Excise removes all the nodes from
the first named node to the second named node, and the children of
the second node are added as children of the parent of the first node.
Thus, for another example:
VP=verb < PP=prep
excise verb prep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(IN in)
(NP (NNP May)))
(. .)))
Apply relabel:
VP=v < PP=prep
relabel prep verbPrep
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(verbPrep (IN in)
(NP (NNP May)))))
(. .)))
The label for the node called prep (PP) is changed to verbPrep.
The other form of relabel uses regular expressions; consider the following
operation:
/^VB.+/=v
relabel v /^VB(.*)$/ #1
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (D was)
(VP (N arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The Tregex pattern matches all nodes that begin "VB" and have at least one
more character. The Tsurgeon operation then matches the node label to the
regular expression "^VB(.*)$" and selects the text matching the first part
that is not completely specified in the pattern. In this case, that is the
part matching the wildcard (.*), which matches all characters after the VB.
The node is then relabeled with that part of the text, causing, for example,
"VBD" to be relabeled "D". The "#1" specifies that the name of the node
should be the first group in the regex.
Apply insert (shown here with inserting a node, but could also be a tree):
S < (NP < (NNP=name !$- DET))
insert (DET Ms.) $+ name
Result:
(ROOT
(S
(NP (DET Ms.)
(NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
The pattern matches the NNP node that is directly dominated by an NP
(which is directly dominated by an S) and is not a direct right sister
of a DET. Thus, the (DET Ms.) node is inserted immediately to the left
of that NNP node, as specified by "$+ name". "$+" is the location and
"name" describes what node the location is with respect to.
Note: Tsurgeon will re-search for matches after each run of the script;
thus, cycles may occur, causing the program to not terminate. The key
is to write patterns that match prior to the changes you would like to
make but that do not match afterwards. If the clause "!$- DET" had been
left out in this example, Tsurgeon would have matched the pattern after
every insert operation, causing an infinite number of DETs to be added.
Apply move:
VP=verb < PP=prep
move prep $- verb
Result:
(ROOT
(S
(NP (NNP Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)))
(PP (IN in)
(NP (NNP May)))
(. .)))
The PP is moved out of the VP that dominates it and added as a direct right
sister of the VP. As for insert, "$-" specifies the location for prep while
"verb" specifies what that location is relative to.
Note: "move" is a macro operation that deletes the given node and then inserts
it. "move" does not use prune, and thus any branches that now lack terminals will
remain rather than being removed.
Apply replace:
S < (NP=name < NNP)
replace name (NP (DET A) (NN woman))
Result:
(ROOT
(S
(NP (DET A)
(NN woman))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
"name" is matched to an NP that is dominated by an S and dominates an NNP, and
a new subtree ("(NP (DET A) (NN woman))") is added in the place where "name" was.
Note: This operation is vulnerable to falling into an infinite loop. See the note
concerning the "insert" operation and how patterns are matched.
Apply adjoin:
S < (NP=name < NNP)
adjoin (NP (DET A) (NN woman) NP@) name
Result:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
First, the NP is matched to the NP dominating the NNP tag. Then, the specified
tree ("(NP (DET A) (NN woman) NP@)") is placed in that location. The "@" symbol
specifies that the children of the original NP node ("name") are to be placed
as children of a new NP node that is directly to the right of (NN woman). If
the specified tree were "(NP (DET A) (NN woman) VP@)" then the child
(NNP Maria_Eugenia_Ochoa_Garcia) would appear under a VP. Exactly one "@" node
must appear in the specified tree in order to indicate where to place the node
from the original tree.
Apply adjoinH:
S < (NP=name < NNP)
adjoinH ((NP (DET A) (NN woman) NP@)) name
Result:
(ROOT
(S
(NP (NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia))))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
This operation differs from adjoin in that it retains the named node (in this
case, "name"). The named node is made the root of the specified tree, resulting
in two NP nodes dominating the DET in this example whereas only one was present
in the previous example. Note that the specified tree is wrapped in an extra
pair of parentheses in order to show the syntax for retaining the named node.
If the extra parentheses were not there and the specified tree was, for example,
(VP (DET A) (NN woman) NP@), the VP would be ignored in order to retain an NP as
the root. Thus, in this case, "adjoinH (VP (DET A) (NN woman) NP@) name" and
"adjoinH ((DET A) (NN woman) NP@) name" both produce the same tree:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
Apply adjoinF:
S < (NP=name < NNP)
adjoinF (NP(DET A) (NN woman) @) name
Result:
(ROOT
(S
(NP (DET A)
(NN woman)
(NP (NNP Maria_Eugenia_Ochoa_Garcia)))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP (NNP May)))))
(. .)))
This operation is very similar to adjoin and adjoinH, but this time the original
named node ("name" in this case) is maintained as the root of the subtree that
is adjoined. Thus, no node label needs to be given in front of the "@" and if
one is given, it will be ignored. For instance, "adjoinF (NP(DET A) (NN woman) VP@) name"
would still produce the same tree as above, despite the VP preceding the @.
Apply coindex:
NP=node < NNP=name
coindex node name
Result:
(ROOT
(S
(NP-1 (NNP-1 Maria_Eugenia_Ochoa_Garcia))
(VP (VBD was)
(VP (VBN arrested)
(PP (IN in)
(NP-2 (NNP-2 May)))))
(. .)))
This causes the named nodes to be numbered such that all nodes that are part
of the same match have the same number and all matches have distinct new names.
We had two instances of an NP dominating an NNP in this example, and they were
renamed such that NP-i < NNP-i for each match, with 1 <= i <= number of matches.
-----------------------------------------
TSURGEON SCRIPTS
-----------------------------------------
Script format:
Tsurgeon scripts are a combination of a Tregex pattern to match and a series
of Tsurgeon operations to perform on that match. The first line of a Tsurgeon
script should be the Tregex pattern. This should be followed by a blank line,
and then each subsequent line may contain one Tsurgeon operation. Tsurgeon
operations should not be separated by blank lines. The following is an example
of correctly formatted script:
S < NP=node < NNP=name
relabel node NP_NAME
coindex node name
Comments:
The character % introduces a comment that extends to the end of the
line. All other intended uses of % must be escaped as \% .
-----------------------------------------
CONTACT
-----------------------------------------
For questions about this distribution, please contact Stanford's JavaNLP group at
parser-support@lists.stanford.edu. We provide assistance on a best-effort basis.
-----------------------------------------
LICENSE
-----------------------------------------
Tregex, Tsurgeon, and Interactive Tregex
Copyright (c) 2003-2011 The Board of Trustees of
The Leland Stanford Junior University. All Rights Reserved.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/ .
For more information, bug reports, fixes, contact:
Christopher Manning
Dept of Computer Science, Gates 2A
Stanford CA 94305-9020
USA
parser-support@lists.stanford.edu
http://nlp.stanford.edu/software/tregex.html