Sequence-to-Sequence (Seq2Seq)

Sequence-to-Sequence (Seq2Seq) is a general end-to-end framework which maps sequences in source domain to sequences in target domain. Seq2Seq model first reads the source sequence using an encoder to build vector-based ‘understanding’ representations, then passes them through a decoder to generate a target sequence, so it’s also referred to as the encoder-decoder architecture. Many NLP tasks have benefited from Seq2Seq framework, including machine translation, text summarization and question answering. Seq2Seq models vary in term of their exact architecture, multi-layer bi-directional RNN (e.g. LSTM, GRU, etc.) for encoder and multi-layer uni-directional RNN with autoregressive decoding (e.g. greedy, beam search, etc.) for decoder are natural choices for vanilla Seq2Seq model. Attention mechanism is later introduced to allow decoder to pay ‘attention’ to relevant encoder outputs directly, which brings significant improvement on top of already successful vanilla Seq2Seq model. Furthermore, ‘Transformer’, a novel architecture based on self-attention mechanism is proposed and has outperformed both recurrent and convolutional models in various tasks, although out-of-scope for this repo, I’d like to refer interested readers to this post for more details

Figure 1: Encoder-Decoder architecture of Seq2Seq model

Setting

Python 3.6.6
Tensorflow 1.12
NumPy 1.15.4

DataSet

IWSLT’15 English-Vietnamese is a small dataset for English-Vietnamese translation task, it contains 133K training pairs and top 50K frequent words are used as vocabularies.
WMT’14 English-French is large dataset for English-French translation task. The goals of this WMT shared translation task are, (1) to investigate the applicability of various MT techniques; (2) to examine special challenges in translating between English and French; (3) to create publicly available corpora for training and evaluating; (4) to generate up-to-date performance numbers as a basis of comparison in future research.
fastText is open source library for efficient text classification and representation learning. Pre-trained word vectors for 157 languages are distributed by fastText. These models were trained on Common Crawl and Wikipedia dataset using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Usage

Run experiment

# run experiment in train mode
python seq2seq_run.py --mode train --config config/config_seq2seq_template.xxx.json
# run experiment in eval mode
python seq2seq_run.py --mode eval --config config/config_seq2seq_template.xxx.json

Encode source

# encode source as CoVe vector
python seq2seq_run.py --mode encode --config config/config_seq2seq_template.xxx.json

Search hyper-parameter

# random search hyper-parameters
python hparam_search.py --base-config config/config_seq2seq_template.xxx.json --search-config config/config_search_template.xxx.json --num-group 10 --random-seed 100 --output-dir config/search

Visualize summary

# visualize summary via tensorboard
tensorboard --logdir=output

Experiment

Vanilla Seq2Seq

Figure 1: Vanilla Seq2Seq architecture

IWSLT’15 EN-VI	Perplexity	BLEU Score
Dev	25.09	9.47
Test	25.87	9.35

Table 1: The performance of vanilla Seq2Seq model on IWSLT’15 English - Vietnamese task with setting: (1) for encoder, model type = Bi-LSTM, num layers = 1, unit dim = 512; (2) for decoder, model type = LSTM, num layers = 2, unit dim = 512, beam size = 10; (3) pre-trained embedding = false, max len = 300

IWSLT’15 VI-EN	Perplexity	BLEU Score
Dev	29.52	8.49
Test	33.16	7.88

Table 2: The performance of vanilla Seq2Seq model on IWSLT’15 Vietnamese - English task with setting: (1) for encoder, model type = Bi-LSTM, num layers = 1, unit dim = 512; (2) for decoder, model type = LSTM, num layers = 2, unit dim = 512, beam size = 10; (3) pre-trained embedding = false, max len = 300

Attention-based Seq2Seq

Figure 2: Attention-based Seq2Seq architecture

IWSLT’15 EN-VI	Perplexity	BLEU Score
Dev	12.56	22.41
Test	10.79	25.23

Table 3: The performance of attention-based Seq2Seq model on IWSLT’15 English - Vietnamese task with setting: (1) for encoder, model type = Bi-LSTM, num layers = 1, unit dim = 512; (2) for decoder, model type = LSTM, num layers = 2, unit dim = 512, beam size = 10; (3) pre-trained embedding = false, max len = 300, att type = scaled multiplicative

IWSLT’15 VI-EN	Perplexity	BLEU Score
Dev	11.83	19.37
Test	10.42	21.40

Table 4: The performance of attention-based Seq2Seq model on IWSLT’15 Vietnamese - English task with setting: (1) for encoder, model type = Bi-LSTM, num layers = 1, unit dim = 512; (2) for decoder, model type = LSTM, num layers = 2, unit dim = 512, beam size = 10; (3) pre-trained embedding = false, max len = 300, att type = scaled multiplicative

Reference

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation [2014]
Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks [2014]
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. Neural machine translation by jointly learning to align and translate [2014]
Minh-Thang Luong, Hieu Pham, Christopher D. Manning. Effective approaches to attention-based neural machine translation [2015]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is all you need [2017]
Luong, Minh-Thang and Manning, Christopher D. Stanford neural machine translation systems for spoken language domains [2015]
Minh-Thang Luong. Neural machine translation [2016]
Thang Luong, Eugene Brevdo, Rui Zhao. Neural machine translation (seq2seq) tutorial