120 lines
5.8 KiB
Plaintext
120 lines
5.8 KiB
Plaintext
Architectural Overview of edu.stanford.nlp.loglinear:
|
|
|
|
The goal of this package is to provide fast, general structure log-linear modelling that's easy to use and to extend.
|
|
|
|
The package is broken into three parts: model, inference, and learning
|
|
|
|
Model contains all of the basic storage elements, as well as means to serialize and deserialize for both storage and
|
|
network transit. Inference depends on model, and provides an implementation of the clique tree message passing algorithm
|
|
for efficient exact inference in tree-structured graphs. Learning depends on inference and model, and provides a simple
|
|
interface to efficient multithreaded batch learning, with an implementation of AdaGrad guarded by backtracking.
|
|
|
|
We will go over model, then inference, then learning.
|
|
|
|
#####################################################
|
|
|
|
Model module overview:
|
|
|
|
#####################################################
|
|
|
|
***
|
|
ConcatVector:
|
|
|
|
The key to the speed of loglinear is the ConcatVector class. ConcatVector provides a useful abstraction for NLP machine
|
|
learning: a concatenation of vectors, treated as a single vector. The basic idea is to have each feature output a vector,
|
|
which are then stored in a ConcatVector (or 'concatenated vector'). When the dot-product is taken of two ConcatVectors,
|
|
the result is the sum of the dot product of each of the concatenated components of each vector in sequence. To write
|
|
that out explicitly, if a feature ConcatVector f is composed of a number of vector f_i's, and a weight ConcatVector w is
|
|
composed of a number of vector w_i's, then dot(f,w) is:
|
|
|
|
\sum_i dot(f_i, w_i)
|
|
|
|
This leaves us with two key advantages over a regular vector: each component can be individually tuned for sparsity, and
|
|
each component has an isolated namespace and so can an individual feature vector can grow after training begins (say
|
|
discovering a new word in a one-hot feature vector), and the weight vector will behave appropriately without hassle.
|
|
|
|
***
|
|
NDArray
|
|
|
|
We have a basic NDArray, which allows a standard iterator over possible assignments that creates a lot of int[] arrays
|
|
on the heap, and a more elaborate iterator that saves GC by mutating a single array passed over. You'll see this used
|
|
throughout the code in hot loops marked by an "//OPTIMIZATION" comment
|
|
|
|
***
|
|
ConcatVectorTable
|
|
|
|
ConcatVectorTable is a subclass of NDArray that we use to store factor tables for the log-linear graphical model, where
|
|
each element of the table represents the features for one joint assignment to the variables the factor is associated
|
|
with. In order to get a factor like you learned about in CS 228, each element of the table is dot-producted with weights.
|
|
We don't do this at construction to allow a single set of GraphicalModel objects to be used throughout training.
|
|
|
|
***
|
|
GraphicalModel
|
|
|
|
GraphicalModel is a super stripped down implementation of a graphical model. It holds factors, represented by lists of
|
|
neighbor indices and a ConcatVectorTable for features. It was deliberate to make all downstream annotations on the model
|
|
(like observations for inference or observations for training) go into a HashMap. This is to maintain easy backwards
|
|
compatibility with previous serialized versions as features change, and to make life more convenient for downstream
|
|
algorithms that may be passing GraphicalModel objects across module or network boundaries, and don't want to create tons
|
|
of little 'ride-along' objects that add annotations to the GraphicalModel.
|
|
|
|
#####################################################
|
|
|
|
Inference module overview:
|
|
|
|
#####################################################
|
|
|
|
***
|
|
TableFactor
|
|
|
|
This is the traditional 'factor' datatype that you're used to hearing about from Daphne in 228 and "Probabilistic
|
|
Graphical Models". It's a subclass of NDArray, and has fast operations for product and marginalize dataflows. It's the
|
|
key building block for inference.
|
|
|
|
***
|
|
CliqueTree
|
|
|
|
This object takes a GraphicalModel at creation and provides high speed tree-shaped message passing inference for both
|
|
exact marginals and exact MAP estimates. It exists as a new object for each GraphicalModel, rather than a static call
|
|
for each model, to allow for cacheing some messages when repeated marginals are needed on only slightly changing models.
|
|
|
|
#####################################################
|
|
|
|
Learning module overview:
|
|
|
|
#####################################################
|
|
|
|
***
|
|
AbstractDifferentiableFunction
|
|
|
|
This follows the Optimize.jl package convention of providing both gradient and function value in a single return value.
|
|
|
|
***
|
|
LogLikelihoodFunction
|
|
|
|
An implementation of AbstractDifferentiableFunction for calculating the log-likelihood of a log-linear model as given by
|
|
a GraphicalModel.
|
|
|
|
***
|
|
AbstractOnlineOptimizer
|
|
|
|
This is the basic interface for online optimizers to follow. It is sketched out right now, but no implementations have
|
|
been made yet.
|
|
|
|
***
|
|
AbstractBatchOptimizer
|
|
|
|
There is a fair amount of redundant complexity involved in writing an optimizer that needs to calculate the gradient
|
|
on the entire batch of examples every update step. The work between threads must be carefully balanced so that the
|
|
time between the first thread finishing, and the last, during which the CPU utilization is far less than 100%, is
|
|
minimized. This is managed through rough estimating of the amount of work each item represents, and a perceptron style
|
|
updating once the system is running, based on CPU time used for each thread. We also implement a convenience function
|
|
here to allow the user to interrupt training early if they are happy with convergence to this point, since that involves
|
|
some tricky Java threading to make work.
|
|
|
|
***
|
|
BacktrackingAdaGradOptimizer
|
|
|
|
This subclasses AbstractBatchOptimizer, and implements a simple AdaGrad gradient descent guarded by backtracking line
|
|
search to maximize an AbstractDifferentiableFunction.
|