Hidden Markov models for word representations
Learn discrete and continuous word representations with Hidden Markov models, including variants defined over unlabeled and labeled parse trees.
Author: Simon Šuster
If you use this code for any of the tree HMM types, please cite our paper:
Word Representations, Tree Models and Syntactic Functions. Simon Šuster, Gertjan van Noord and Ivan Titov. arXiv preprint arXiv:1508.07709, 2015. bibtex
Despite the mentioned, the running time is relatively slow and is especially sensitive to the number of states. A speed-up would be possible through the use of sparse matrices, but at several places the replacement is not trivial.
The three main classes define the type of architecture that can be used:
Models can be trained by invoking run.py
. Use --tree
to train a Hidden Markov tree model, and --rel
to train a Hidden Markov tree model with syntactic relations. A full list of options can be found by running:
python3.3 run.py --help
First prepare the data that you plan to use. This will replace unfrequent tokens with *unk*:
python3.3 corpus_normalize.py --dataset data/sample.en --output $DATASET --freq_thresh 1
Then, to train a sequential Hidden Markov model:
python3.3 run.py --dataset $DATASET --desired_n_states 60 --max_iter 2 --n_proc 1 --approx
This will create an output directory starting with hmm_...
.
The following configuration will train the same model, but with the splitting procedure and Brown initialization:
python3.3 run.py --dataset $DATASET --start_n_states 30 --desired_n_states 60 -brown sample.en.30.paths --max_iter 2 --n_proc 1 --approx
To train a tree model, again start by normalizing the parsed data with *unk*:
python3.3 corpus_normalize.py --dataset data/sample.en.conll --output $DATASET_TREE --freq_thresh 1 --conll
Then to train a Hidden Markov tree model:
python3.3 run.py --tree --dataset $DATASET_TREE --start_n_states 30 --desired_n_states 60
--max_iter 2 --n_proc 1 --approx
To train a Hidden Markov tree model with syntactic relations:
python3.3 run.py --rel --dataset $DATASET_TREE --desired_n_states 60 --max_iter 2 --n_proc 1 --approx
To carry out NER evaluation on the Conll2003 datasets for English:
python3.3 eng_ner_run.py -d posterior -rep hmm_.../ -o out --n_epochs 2
This runs an averaged structured perceptron, an adaptation of LXMLS’s implementation, for 2 epochs (more needed in practice). The training time is about 1 minute per epoch. Note that prior to training the perceptron, word representations are first inferred or decoded, which takes a couple of minutes as well. We choose here posterior
(for posterior decoding) that produces comparable results to Viterbi. If you choose the posterior_cont
and posterior_cont_type
decoding methods, please have in mind that they are very memory intensive.
See the scripts under output for model introspection utilities.