Document Classification with HAN, LSTM and BERT

Thanks to Deep Learning, Natural Language Processing (NLP) has grown a lot over the past few years. This project deals
with some of the latest techniques of Document Classification, an important task in NLP. It consists to assign a
document to one category. If category is actually a sentiment (a numeric evaluation of text), we talk about
Sentiment Analysis, like datasets taken in this implementation.

This Project

In this project, we want to replicate some literature experiments and compare different approaches:

HAN is a sophisticated model based on Recurrent Neural Networks, in particular GRU (special version to remember
long term dependencies), combined with some hierarchical attention mechanisms, that consider words and sentences
differently. The idea is to let the model to pay more or less attention to individual words and sentences when
constructing the representation of document (required for classification) [1].
BERT has an architecture based on Transformers: they are layers with strong multi-head attention and others
techniques like positional encoding and residual connections, without RNNs. Base version has 12
transformers-encoders, while Large has 24 [2].
LSTM is a RNN architecture based on single bilateral LSTM layer (like GRU, LSTM has some gates to avoid vanishing
gradient problem), with appropriate regularization and without attention mechanisms [3].
KD-LSTM: same authors of previous model proposed a Knowledge Distillation version of their LSTM, thanks to BERT.
The main idea is to use a big teacher model (BERT, in this case) to distill information to a smaller, faster, student
network (LSTM) to achieve better results [4].

Datasets

To test models, three Sentiment Analysis datasets were chosen:

IMDB Small: short version of IMDB (about 25 000 film reviews), with only two sentiments: positive or negative.
This dataset can be retrieved in TensorFlow datasets.
IMDB: large version with about 135 000 reviews and 10 sentiments (stars from 1 to 10). It was found
here.
Yelp 2014: dataset of restaurants reviews with sentiment from 1 to 5. It was harder to retrieve the right year
version, so we downloaded the complete version and we selected only 2014
reviews by hand (Python script, in utils.py). Total reviews are about 900 000 versus 1 million in original dataset (maybe the original includes some
2013 or 2015 data). Anyway results are similar.

Results

We report accuracy on test set for every dataset and model. BERT reaches better results, but it is also the heaviest
network. LSTM also achieves good results, without attention mechanisms. Almost all tests are quite similar to cited
papers results.

Model	IMDB Small	IMDB	Yelp 2014
HAN	86.6	46.4	69.0
BERT_base	94.6	57.7	77.4
LSTM_reg	94.2	52.7	71.1
KD-LSTM_reg	94.6	58.5	71.7

Visualization of Attention in HAN

This code allows to visualize attention in HAN model (with hanPredict function), because it is relative easy to extract
partial model weights to reconstruct the most attentioned words and sentences. Here two reviews from Yelp, blue
represents most important sentences and red most relevant words.

HAN PREDICTION: 5, TARGET: 5. Here the word ‘bad’ it has been well interpreted based on context.

HAN PREDICTION: 1, TARGET: 1. They were attentioned the first and the last sentences, the word ‘recommend’ here has
a different sense, because context is different.

Reproducing Experiments

Dependencies

This project uses PyTorch and TensorFlow 2 (only for HAN model), for training GPU is needed. Code was developed and
tested with these main dependencies:

Python 3.7.7
numpy 1.18.1
ntlk 3.4.5
pandas 1.0.3
pytorch 1.4.0
tensorboard 2.1.0
tensorflow 2.1.0
transformers 2.10.0

All dependencies can be installed, after cloned this repository, with command line:

$ pip install -r requirements.txt

For Han Preprocessing glove.6B.100d.txt is also required (for this project was chosen
100 dimension version) that can be retrieved in GloVe site.

How make it works

The pipeline is getting the dataset in pandas dataframe format (there are some utils functions, code expects dataset in
datasets/ local directory), preprocessing (it automatically splits train, valid and test sets), training and
evaluating. Here a main.py example:

from preprocessing import bertPreprocessing
from train import lstmTrain
from utils import readIMDB
dataset_name, n_classes, data_df = readIMDB()
bertPreprocessing(dataset_name, data_df, MAX_LEN=128)
lstmTrain(dataset_name, n_classes, TRAIN_BATCH_SIZE=64, EPOCHS=20, LEARNING_RATE=1e-03)

Now logs are continuously saved and update. Tensorboard is a good tool to tracking and visualizing metrics during
training:

$ tensorboard --logdir logs/IMDB_lstm

At the end the model is saved in models/model_IMDB_lstm/, it is possible to evaluate model results on test set adding
the model path and running this function:

from predict import lstmEvaluate
lstmEvaluate('IMDB', 10, model_path='models/model_IMDB_lstm/20200618-133908')

Report

A copy of the report (italian) can be found
here.

References

[1] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks
for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, pages 1480–1489.

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages
4171–4186.

[3] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Rethinking complex neural network architectures
for document classification. In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4046–4051.

[4] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT: BERT for Document Classification. Arvix.