项目作者: vivekverma239

项目描述 :
Language Model Pretraining
高级语言: Python
项目地址: git://github.com/vivekverma239/lm_pretraining.git
创建时间: 2019-02-19T19:08:56Z
项目社区:https://github.com/vivekverma239/lm_pretraining

开源协议:

下载


Language Model Pretraining for NLP Tasks

This repo is to see the effect of Language Model Pretraining on common NLP tasks. The idea is also to provide
a simple interface to use pretrained lanugage models in keras.

General Idea

The general idea is pretty simple, pretrain LSTM layers on language modelling task then use the trained weights in downstream NLP task. It is just same as ULMFIT but can be used as a model in keras.

How to use?

There are two steps to it, first pretrain the LSTM encoder and then simply use it Keras Model.

Pretrain LSTM Encoder:

  1. python pretrain.py --train_file TRAIN_FILE --valid_file VAL_FILE --tokenizer [TOKENIZER]

Params:

  • TRAIN_FILE: Training file for Language Model, each sentence should be in a seperate line.
  • VAL_FILE: Validation file for Language Model, same format as above
  • TOKENIZER [nltk]: (spacy/nltk) Which tokenizer to use, nltk is by default used
  • config params:
    • batch_size: Batch size for training and evaluation (default 32)
    • hidden_size: LSTM hidden size (default 500)
    • num_layers: Number of LSTM Layers (default 1)
    • epochs: Number of epochs to train (default 10)
    • seq_length: BPTT for LM training (default: 70)
    • max_vocab_size: Max vocab size for embedding (default: 60000)
    • embed_size: Embedding Size (default: 500)
    • dropout : Dropout (default: 0.5)
    • num_candidate_samples: Number of candidates for sampled Softmax (default: 2048)
    • clip: Clip Gradient to (default: 0.25)

Using the pretrained LSTM Encoder:

  1. from keras import layers
  2. import keras.backend as K
  3. from keras.callbacks import Callback
  4. # Model expects a sequence of tokens (words) as input
  5. input_ = layers.Input(shape=(maxlen,), dtype=tf.string)
  6. # pretrained_model_path -> Path where pretrained model is saved
  7. pretrained_model = PretrainedLSTM(pretrained_model_path, input_, return_sequences=False)
  8. encoder_output = pretrained_model.outputs[0]
  9. final_output = layers.Dense(1, activation="sigmoid")(encoder_output)
  10. model = Model(inputs=[input_], outputs=[final_output])
  11. model.compile("adam", loss="binary_crossentropy" , metrics=['acc'])
  12. # This is needed to initialize word to idx lookup layer
  13. class TableInitializerCallback(Callback):
  14. """ Initialize Tables """
  15. def on_train_begin(self, logs=None):
  16. K.get_session().run(tf.tables_initializer())
  17. callbacks = [TableInitializerCallback()]
  18. # Finally fit
  19. model.fit(x_train, y_train, epochs=10, callbacks=callbacks)

Examples:

  • IMDB Movie Review Dataset
    Open In Colab

Improvement:

  • Data Efficiency: The best improvement is seen in terms of data efficiency, with very few examples model performance increases significantly.
    For this example we use only IMDB data for pretraining. Including other sources will further improve the performance.
  • Final Performance: The final Performance of the model also improves, we achieved a final accuracy of 92.41% on test data.

IMDB Learning Curve

TODOs:

  • Improvement in Language Model
  • Include Bidirectional LM Support
  • Package it in PIP Package