CofeehousePy/services/corenlp/scripts/truecase/readme.txt

78 lines
2.4 KiB
Plaintext

train.4_5M.truecase.txt is 4.5M sentences extracted from wikipedia
train.1_5M.truecase.txt is 1.5M sentences extracted from wikipedia (a subset)
test.truecase.txt is 500K sentences for test
truecasing.4_5M.fast.qn.ser.gz is a model
truecasing.4_5M.fast.caseless.qn.ser.gz is another model, but works on caseless text
1.5M...ser.gz are smaller models if needed
MixDisambiguation.list came from... somewhere.
Most likely it was generated with
projects/core/src/edu/stanford/nlp/truecaser/MixDisambiguation.java
but the question then is, what data was used?
the fast models mean they have been retrained after pruning (a CRF training feature)
the full model does not have this feature and is huge
prop files, makefile, readme, etc are in git
train.orig.txt is a copy of /scr/nlp/data/gale/NIST09/truecaser/crf/noUN.input
test.orig.txt is a copy of /scr/nlp/data/gale/AE-MT-eval-data/mt06/cased/ref0
WikiExtractor tool from
https://github.com/attardi/wikiextractor
Wikipedia data at:
/u/scr/nlp/data/Wikipedia/enwiki-20190920-pages-articles-multistream.xml.bz2
Steps to rebuild the truecase data:
1) run wikiextractor on the data
nohup python3 wikiextractor/WikiExtractor.py wiki/enwiki-20190920-pages-articles-multistream.xml > text.out 2>&1 &
2) run pick_text to extract a certain # of text, such as 2M sentences.
This script has some decent defaults so you should be able to just run it.
python3 pick_text.py
3) tokenize the text. Suggestion:
nohup java -mx90g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file ~/truecase/wiki.raw.txt -ssplit.newlineIsSentenceBreak always -outputFormat tagged > wiki.tokenized.txt
If you don't have that much memory lying around, you can first split
the file into 2 or more pieces
To rebuild the models:
use a machine with huge RAM
run make
Rough description of the models:
truecasing.fast.caseless.qn.ser.gz:
has been reduced in size from the full version using the feature dropping mechanism.
operates on text without considering the original case
truecasing.fast.qn.ser.gz
also reduced in size
but uses the original case to determine the replacement case
actually, this is not particularly useful considering the way the wikipedia data is built
would need to be rebuilt using data with the casing intentionally fudged into common errors
truecasing.full.qn.ser.gz
same, but just as useless and significantly larger since it didn't use feature pruning