Automatic transliteration with LSTM
This is a tool to transliterate inconsistently romanized text. It is tested on Armenian (hy-AM
). We invite everyone interested to add more languages. Instructions are below.
Read more in the corresponding blog post.
Install required packages:
pip install -r requirements.txt
Before training on the corpus we need to compute the vocabularies by the following command:
python make_vocab.py --language hy-AM
The actual training is initiated by a command like this:
python -u train.py --hdim 1024 --depth 2 --batch_size 200 --seq_len 30 --language hy-AM &> log.txt
--hdim
and --depth
define biLSTM parameters. --seq_len
is the maximum length of a character sequence given to the network. The output will be written in log.txt
.
During the training the models are saved in the model
folder. The following command will run the test set through the selected model:
python -u test.py --hdim 1024 --depth 2 --model {MODEL} --language hy-AM
The above command expects that the test set contains text in the original language. The next one takes a file with romanized text and prints the transliterated text:
python -u test.py --hdim 1024 --depth 2 --model {MODEL} --language hy-AM --translit_path {FILE_NAME}
Finally, plot_loss.py
command will draw the graphs for training and validation losses for the given log file. --ymax
puts a limit on y
axis.
python plot_loss.py --log log.txt --window 10000 --ymax 3
This is what we did for Armenian. Something similar will be needed for the other not-very-different languages.
First, we prepare the corpus.
train.txt
, 10% - val.txt
, 10% - test.txt
) and store them in the languages/LANG_CODE/data/
folderNext we add some language specific configuration files:
languages/LANG_CODE/transliteration.json
file with romanization rules, like this onelanguages/LANG_CODE/long_letters.json
file with an array of the multi-symbol letters of the current language (Armenian has ու
and two capitalizations of it: Ու
and ՈՒ
)make_vocab.py
to generate the “vocabulary”