CofeehousePy/services/corenlp/doc/segmenter/README-Chinese.txt

114 lines
3.7 KiB
Plaintext

Stanford Chinese Segmenter - v4.2.0 - 2020-11-17
--------------------------------------------
(c) 2003-2020 The Board of Trustees of The Leland Stanford Junior University.
All Rights Reserved.
Chinese segmenter by Pi-Chuan Chang, Huihsin Tseng, and Galen Andrew
CRF code by Jenny Finkel
Support code by Stanford JavaNLP members
The system requires Java 6 (JDK1.6) or higher.
USAGE
Unix:
> segment.sh [-k] [ctb|pku] <filename> <encoding> <size>
ctb : Chinese Treebank
pku : Beijing Univ.
filename: The file you want to segment. Each line is a sentence.
encoding: UTF-8, GB18030, etc.
(This must be a character encoding name known by Java)
size: size of the n-best list (just put '0' to print the best hypothesis
without probabilities).
-k: keep all white spaces in the input
* Sample usage: segment.sh ctb test.simp.utf8 UTF-8
* Note: Large test file requires large memory usage. For processing
large data files, you may want to change memory allocation in Java
(e.g., to be able to use 8Gb of memory, you need to change "-mx2g"
to "-mx8g" inside segment.sh). Another solution is to split the test
file to smaller ones to reduce memory usage.
* In addition to the command line scripts, there is a Java class
"SegDemo" which shows how to call the segmenter in Java code.
Usage:
java -mx2g -cp "*:." SegDemo test.simp.utf8
SegDemo as supplied assumes that it is running in the home directory of the
installation, and to run anywhere else, you need to set the path to the
dictionaries.
SEGMENTATION MODELS
Two segmentation models are provided. The "ctb" model was trained with Chinese
treebank (CTB) segmentation, and the "pku" model was trained with Beijing
University's (PKU) segmentation. PKU models provide smaller vocabulary
sizes and OOV rates on test data than CTB models.
For both CTB and PKU, we provide two models representing slightly different
feature sets:
Models "ctb" and "pku" incorporate lexicon features to increase consistency in
segmentation.
DATA
[Segmentation standard]
(Chinese Penn Treebank)
The supplied segmenter segments according to Chinese Penn Treebank
segmentation conventions. For more information, see:
http://www.cis.upenn.edu/~chinese/segguide.3rd.ch.pdf
(Beijing University)
This segmenter segments according to the Peking University standard.
For more information, see:
http://sighan.cs.uchicago.edu/bakeoff2005/data/pku_spec.pdf
[Training data]
(Chinese Penn Treebank)
"data/ctb.gz" is trained with the training data in the LDC Chinese Treebank 7
(Beijing University)
"data/pku.gz" is trained with the data provided by Peking University
for the Second International Chinese Word Segmentation Bakeoff.
See:
http://sighan.cs.uchicago.edu/bakeoff2005/
MORE INFORMATION
The details of the segmenter can be found in this paper:
Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky
and Christopher Manning.
"A Conditional Random Field Word Segmenter."
In Fourth SIGHAN Workshop on Chinese Language Processing. 2005.
http://nlp.stanford.edu/pubs/sighan2005.pdf
(Notice that the training data, features and normalizations
used in this distribution are not exactly the same as the systems
described in the paper.)
The description of the lexicon features can be found in:
Pi-Chuan Chang, Michel Gally and Christopher Manning.
"Optimizing Chinese Word Segmentation for Machine Translation Performance"
In ACL 2008 Third Workshop on Statistical Machine Translation.
http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf
For more information, look in the included Javadoc, starting with the
edu.stanford.nlp.ie.crf.CRFClasifier class documentation.
Send any questions or feedback to java-nlp-support@lists.stanford.edu.