Stanford Arabic Segmenter - v4.2.0 - 2020-11-17
--------------------------------------

(c) 2003-2020  The Board of Trustees of The Leland Stanford Junior University.
All Rights Reserved.

Arabic segmenter by Spence Green
CRF code by Jenny Finkel
Support code by Stanford JavaNLP members

The system requires Java 6 (JDK1.6) or higher.

The Arabic word segmenter is based on a conditional random field (CRF) sequence
classifier.  The included TreebankPreprocessor package can be used to generate
training data for the model, assuming a copy of Penn Arabic Treebank (ATB) is available.

INSTALLATION

The segmenter does not require any platform-specific installation. Unpack the
gzip'd tar file in any convenient location. To invoke the segmenter, you'll need
to add the ".jar" files in the unpacked directory to your Java classpath. If you
don't know how to modify your classpath, read this tutorial:

  http://docs.oracle.com/javase/tutorial/essential/environment/paths.html

The commands in the remainder of this README assume that you've added the jars
to your classpath.


USAGE

The segmenter assumes that you have a newline delimited text file with UTF-8
encoding. If you're working with Arabic script, chances are that your file is
already encoded in UTF-8. If it is in some other encoding, then you should
check Google for a tutorial on "iconv," a handy tool that allows you to
automatically change file encodings.

Suppose that your raw Arabic file is called "my_arabic_file.txt". You can
segment the tokens in this file with the following command:

  java -mx1g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -loadClassifier data/arabic-segmenter-atb+bn+arztrain.ser.gz -textFile my_arabic_file.txt > my_arabic_file.txt.segmented

Additional command line options are available to mark proclitics and enclitics
that were split by the segmenter. Suppose that the raw token "AABBBCC" was split
into three segments "AA BBB CC", where "AA" is a proclitic and "CC" is an
enclitic. You can direct the segmenter to mark these clitics with these command-line
options:

  -prefixMarker #
  -suffixMarker #

In this case, the segmenter would produce "AA# BBB #CC". Here '#' is the segment marker,
but the options accept any character as an argument. You can use different markers
for proclitics and enclitics.

ORTHOGRAPHIC NORMALIZATION

The segmenter contains a deterministic orthographic normalization package:

  edu.stanford.nlp.international.arabic.process.ArabicTokenizer

ArabicTokenizer supports various orthographic normalization options that can be configured
in ArabicSegmenter using the -orthoOptions flag. The argument to -orthoOptions is a comma-separated list of
normalization options. The following options are supported:

  useUTF8Ellipsis   : Replaces sequences of three or more full stops with \u2026
  normArDigits      : Convert Arabic digits to ASCII equivalents
  normArPunc        : Convert Arabic punctuation to ASCII equivalents
  normAlif          : Change all alif forms to bare alif
  normYa            : Map ya to alif maqsura
  removeDiacritics  : Strip all diacritics
  removeTatweel     : Strip tatweel elongation character
  removeQuranChars  : Remove diacritics that appear in the Quran
  removeProMarker   : Remove the ATB null pronoun marker
  removeSegMarker   : Remove the ATB clitic segmentation marker
  removeMorphMarker : Remove the ATB morpheme boundary markers
  removeLengthening : Replace sequences of three identical characters with one
  atbEscaping       : Replace left/right parentheses with ATB escape characters

The orthographic normalization options must match at both training and test time!
Consequently, if you want to apply an orthographic normalization that differs from the
default, then you'll need to retrain ArabicSegmenter.

SEGMENTING DIALECTAL TEXT

The segmenter supports segmenting dialectal Arabic using domain adaptation.
[Hal Daumé III, Frustratingly Easy Domain Adaptation, ACL 2007] The model that
comes with this distribution is trained to support Egyptian dialect. To
indicate that the provided text is in Egyptian dialect, add the command-line
option:

  -domain arz

You can also construct a file that specifies a dialect for each
newline-separated sentence, by adding "atb" [MSA] or "arz" [Egyptian] at the
beginning of each line followed by a space character. This feature is enabled
with the flag:

  -withDomains

See the bottom of the next section for information about training the
segmenter on your own dialectal data.

TRAINING THE SEGMENTER

The current model is trained on parts 1-3 of the ATB, parts 1-8 of the ARZ treebank,
and the Broadcast News treebank. To train a new model, you need to create a data file from
the unpacked LDC distributions. You can create this data file with
the script tb-preproc, which is included in the segmenter release.

You'll need:

  - Python 2.7 (for running the preprocessing scripts)
  - an unpacked LDC distribution with files in *integrated* format
  - a directory with four text files called dev, train, test, and all, each of
    which lists filenames of integrated files (these usually end in .su.txt),
    one per line. dev, train, and test can be left empty if you have no need
    for a train/test split.

The splits that we use are included in the distribution. You can also find
them at

http://nlp.stanford.edu/software/parser-arabic-data-splits.shtml

Once you have these, run the tb-preproc script, providing the necessary
arguments:

  atb_base -      the most specific directory that is a parent of all
                  integrated files you wish to include. Files will be located
                  recursively by name in this directory; it is not recommended
                  to have several copies of the same distribution within this
                  directory (though all that will happen is that you will
                  train on redundant data).

  splits_dir -    the directory containing dev, train, and test listings

  output_prefix - the location and filename prefix that will identify the
                  output files. The preprocessor appends "-all.utf8.txt" to
                  this argument to give the name of the output file for the
                  "all" split (and similarly for dev, train, and test).

  domain -        [optional] a label for the Arabic dialect/genre that this
                  data is in. Our model uses "atb" for ATB1-3, "bn" for
                  Broadcast News, and "arz" for Egyptian. If a domain is
                  given, additional files will be generated (named e.g.
                  "output_prefix-withDomains-all.utf8.txt") for training the
                  domain adaptation model.

Suppose your output_prefix is "./atb". You should see files in the current
working directory named

  atb-dev.utf8.txt
  atb-train.utf8.txt
  atb-test.utf8.txt

You can use the train file to retrain the segmenter with this command:

  java -Xmx12g -Xms12g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -trainFile atb-train.utf8.txt -serializeTo my_trained_segmenter.ser.gz

This command will produce the serialized model "my_trained_segmenter.ser.gz"
that you can use for raw text processing as described in the "USAGE" section above.

The command above train the model with L2 regularization. For the model
included in the distribution has been trained using L1 regularization, which
decreases the model file size dramatically in exchange for a usually
negligible drop in accuracy. To use L1 regularization, add the options

  -useOWLQN -priorLambda 0.05

TRAINING FOR DIALECT

To train a model with domain adaptation, first make sure you have generated a
training file with domain labels. You can create this using the preprocessing
script with the optional domain argument, or do it yourself with a simple sed
script (the -withDomains files differ from the simple training files only in
the presence of a domain identifier prepended to each line followed by a
space). These domain labels can be arbitrary strings, as long as they don't
contain whitespace characters; thus, if you have data available for other
dialects in ATB format, it is possible to train your own system that can
support these dialects. For best results, include MSA data as well as your
dialect data in your training. You can do this by simply concatenating the
ATB1-3 -withDomains file and the dialect -withDomains file. (Adding data from
dialects other than your target dialect should not hurt performance, as long
as they are marked as different domains--it may even help!)

The training command for domain-labeled data is:

  java -Xmx64g -Xms64g edu.stanford.nlp.international.arabic.process.ArabicSegmenter -withDomains -trainFile atb+arz-withDomains-train.utf8.txt -serializeTo my_trained_segmenter.ser.gz

Warning: training with lots of data from several domains requires a lot of
memory and processor time. If you have enough memory to fit all of the
weights for the entire dataset in RAM (this is a bit less than 64G for ATB1-3
+ BN + ARZ), training will take about ten days of single-threaded processor
time. This can be parallelized by adding the option

  -multiThreadGrad <num_threads>

If you are not running on a machine with 64G of RAM, the training is likely to
take much longer. You have been warned.