DeEpLearning models for MultIlingual haTespeech
Solving the problem of hate speech detection in 9 languages across 16 datasets.
Please look here to check model loading and inference.
Please cite our paper in any published work that uses any of these resources.
@inproceedings{aluru2021deep,
title={A Deep Dive into Multilingual Hate Speech Classification},
author={Aluru, Sai Saketh and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
booktitle={Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14--18, 2020, Proceedings, Part V},
pages={423--439},
year={2021},
organization={Springer International Publishing}
}
./Dataset --> Contains the dataset related files.
./BERT_Classifier --> Contains the codes for BERT classifiers performing binary classifier on the dataset
./CNN_GRU --> Contains the codes for CNN-GRU model
./LASER+LR --> Containes the codes for Logistic regression classifier used on top of LASER embeddings
Make sure to use Python3 when running the scripts. The package requirements can be obtained by running pip install -r requirements.txt
.
Check out the Dataset
folder to know more about how we curated the dataset for different languages. There are few datasets which requires crawling them hence we can gurantee the retrieval of all the datapoints as tweets may get deleted.
We release the code for train/finetuning the following models along with their hyperparamters.
best for high resource language
,
best for low resource language
fastest to train
,
slowest to train
mBERT Baseline:
This setting consists of using multilingual bert model with the same language dataset for training and testing. Refer to BERT Classifier
folder for the codes and usage instructions.
mBERT All_but_one:
This setting consists of using multilingual bert model with training dataset from multiple languages and validation and test from a single target language. Refer to BERT Classifier
folder for the codes and usage instructions.
Translation + BERT Baseline:
This setting consists of translating the other language datasets to english and finetuning the bert-base model using this translated datasets. Refer to BERT Classifier
folder for the codes and usage instructions.
CNN+GRU Baseline:
This setting consists of using MUSE word embeddings along with a CNN-GRU based model, and training and testing on the same language. Refer to CNN_GRU
folder for the codes and usage instructions.
LASER+LR baseline:
This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The training and testing dataset are from the same language. Refer to LASER+LR
folder for the codes and usage instructions.
LASER+LR all_but_one:
This setting consists of training a logistic regression model on the LASER embeddings of the dataset. The dataset from other languages are also used to train the LR model. Refer to LASER+LR
folder for the codes and usage instructions.
Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. 2020. “Deep Learning Models for Multilingual Hate Speech Detection“. ECML-PKDD