项目作者: FrancescoGradi

项目描述 :
Comparison between RNNs and Attention in Document Classification
高级语言: Python
项目地址: git://github.com/FrancescoGradi/DocumentClassificationwithHANandBERT.git


Document Classification with HAN, LSTM and BERT

Thanks to Deep Learning, Natural Language Processing (NLP) has grown a lot over the past few years. This project deals
with some of the latest techniques of Document Classification, an important task in NLP. It consists to assign a
document to one category. If category is actually a sentiment (a numeric evaluation of text), we talk about
Sentiment Analysis, like datasets taken in this implementation.

This Project

In this project, we want to replicate some literature experiments and compare different approaches:

  • HAN is a sophisticated model based on Recurrent Neural Networks, in particular GRU (special version to remember
    long term dependencies), combined with some hierarchical attention mechanisms, that consider words and sentences
    differently. The idea is to let the model to pay more or less attention to individual words and sentences when
    constructing the representation of document (required for classification) [1].

  • BERT has an architecture based on Transformers: they are layers with strong multi-head attention and others
    techniques like positional encoding and residual connections, without RNNs. Base version has 12
    transformers-encoders, while Large has 24 [2].

  • LSTM is a RNN architecture based on single bilateral LSTM layer (like GRU, LSTM has some gates to avoid vanishing
    gradient problem), with appropriate regularization and without attention mechanisms [3].

  • KD-LSTM: same authors of previous model proposed a Knowledge Distillation version of their LSTM, thanks to BERT.
    The main idea is to use a big teacher model (BERT, in this case) to distill information to a smaller, faster, student
    network (LSTM) to achieve better results [4].

Datasets

To test models, three Sentiment Analysis datasets were chosen:

Results

We report accuracy on test set for every dataset and model. BERT reaches better results, but it is also the heaviest
network. LSTM also achieves good results, without attention mechanisms. Almost all tests are quite similar to cited
papers results.

Model IMDB Small IMDB Yelp 2014
HAN 86.6 46.4 69.0
BERT_base 94.6 57.7 77.4
LSTM_reg 94.2 52.7 71.1
KD-LSTM_reg 94.6 58.5 71.7

Visualization of Attention in HAN

This code allows to visualize attention in HAN model (with hanPredict function), because it is relative easy to extract
partial model weights to reconstruct the most attentioned words and sentences. Here two reviews from Yelp, blue
represents most important sentences and red most relevant words.





HAN PREDICTION: 5, TARGET: 5. Here the word ‘bad’ it has been well interpreted based on context.





HAN PREDICTION: 1, TARGET: 1. They were attentioned the first and the last sentences, the word ‘recommend’ here has
a different sense, because context is different.

Reproducing Experiments

Dependencies

This project uses PyTorch and TensorFlow 2 (only for HAN model), for training GPU is needed. Code was developed and
tested with these main dependencies:

  • Python 3.7.7
  • numpy 1.18.1
  • ntlk 3.4.5
  • pandas 1.0.3
  • pytorch 1.4.0
  • tensorboard 2.1.0
  • tensorflow 2.1.0
  • transformers 2.10.0

All dependencies can be installed, after cloned this repository, with command line:

  1. $ pip install -r requirements.txt

For Han Preprocessing glove.6B.100d.txt is also required (for this project was chosen
100 dimension version) that can be retrieved in GloVe site.

How make it works

The pipeline is getting the dataset in pandas dataframe format (there are some utils functions, code expects dataset in
datasets/ local directory), preprocessing (it automatically splits train, valid and test sets), training and
evaluating. Here a main.py example:

  1. from preprocessing import bertPreprocessing
  2. from train import lstmTrain
  3. from utils import readIMDB
  4. dataset_name, n_classes, data_df = readIMDB()
  5. bertPreprocessing(dataset_name, data_df, MAX_LEN=128)
  6. lstmTrain(dataset_name, n_classes, TRAIN_BATCH_SIZE=64, EPOCHS=20, LEARNING_RATE=1e-03)

Now logs are continuously saved and update. Tensorboard is a good tool to tracking and visualizing metrics during
training:

  1. $ tensorboard --logdir logs/IMDB_lstm

At the end the model is saved in models/model_IMDB_lstm/, it is possible to evaluate model results on test set adding
the model path and running this function:

  1. from predict import lstmEvaluate
  2. lstmEvaluate('IMDB', 10, model_path='models/model_IMDB_lstm/20200618-133908')

Report

A copy of the report (italian) can be found
here.

References

[1] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks
for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies
, pages 1480–1489.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies
, Volume 1 (Long and Short Papers), pages
4171–4186.

[3] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Rethinking complex neural network architectures
for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies
, Volume 1 (Long and Short Papers), pages 4046–4051.

[4] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. Arvix.