A hierarchical bi-LSTM model trained to identify the author of a given email (SMAI@IIIT-H 2017)
SMAI@IIIT-H (Monsoon 2017)
Course Instructor
Project Mentor
Classify emails from the Enron email dataset based on their predicted authorship, and used the trained classifier to identify authors of test samples.
Available here, the dataset contains 0.5 million emails from about 150 users, who were employees of Enron.
The classifers use the authors as classess and the emails as samples to be assigned to those classes by authorship.
The number of author classes were fixed while maximising the number of emails per author, and while keeping the emails-per-author ratio similar for every author class.
This number was found to be 10 authors with 800-1000 emails each.
The Enron corpus contains all emails in raw form, including not only the message but also all the email metadata.
The data is cleaned to keep only the subject and body of the mails. All attached forward chains are removed, including forwarded threads, and salutations.
The data is also tokenised by word, sentence and paragraph, and is case normalised.
The following different models have been implemented and tested:
The CNN can identify commonly used groups of words and phrases by an author. Also, the CNN captures localized chunks of information which is useful for finding phrasal units within long texts.
There are three layers to the CNN
The Bi-LSTM is a commonly used technique for text classification.
LSTMs are a special kind of RNN which are more capable of remembering long term dependencies in a sequence. This gives more context to the classifier which helps in author identification while processing a sequence of text.
There are three layers to the model
LSTMs are known to work best for a sequence of length of 10-15 elements. However, in this implementation the model can take the entire document, increasing the length and hence the overall context for classification.
There are four layers to this model
This model appends stylometric features to the final document embedding in the hierarchical Bi-LSTM, right before it is passed on to the dense layer. The classification is now performed these augmented documenting-embeddings.
The stylometric features extracted from the data and experimented with are
root/
| data_preprocessing_scripts/
- dataProcessing.py
| extracted_features/
- adjperemail.txt
- avgsentlenperemail.txt
- avgwordlenperemail.txt
- charsperemail.txt
- funcwordsperemail.txt
- perpronperemail.txt
- stylometricVector.txt
- uniqbytotperemail.txt
- wordsperemail.txt
| feature_extraction_scripts/
- adjperemail.py
- avgsentlenperemail.py
- avgwordlenperemail.py
- charsperemail.py
- funcwordsperemail.py
- perpronperemail.py
- stylometricVector.py
- uniqbytotperemail.py
- wordsperemail.py
| models/
- CNN.py
- HierLSTM_withStylometry.py
- HierLSTM.py
- LSTM_final.py
- README.md