Jigsaw Unintended Bias in Toxicity Classification
The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation.
A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion.
When the Conversation AI team first built toxicity models, they found that models predicted a high likelihood of toxicity for comments containing those identities (e.g. “gay”), even when those comments were not actually toxic (such as “I am a gay woman”). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users.
Here’s a good example from training set that clarifies the main problem we face in the competition.
For the Toxity Comment classification task, the model is prone to predict the last comment to toxic class,because it mentions ‘gay’.
To solve this problem, we need to first come up with a metric that can estimate whether the model do a good job for this type of problem. Therefore, here comes the competition evaluation metric.
Basically, the finally score is an average of 4 AUC. 3 of them only take into account parts of the dataset that depending on whether the comment mentions word like ‘gay’ and whether it’s toxic.
Score = 1/4 OverAll_AUC + 1/4 Subgrou_AUC + 1/4 BPSN_AUC + 1/4 BNSP_AUC
For more detailed explanation, please check the link here.
Here’s an prediction validation score of one of our model. We can see that the subgroup AUC is pretty low, which affects the overall score a lot.
Because of the 4 AUC average evaluation metric, we try to make a custom loss fuction instead of just using the binary cross entropy. It turn out that not only the custom loss fuction works well (boost LSTM model AUC:0.930 -> 0.934 when doing experiment), when we use the custom loss for fine-tuning Bert & GPT2, it also gives the model great boost.
Because we don’t have enough computation power (most of the time we only use kaggle kernel with single Tesla P100 GPU), it’s hard for us to do even sigle epoch fune-tuning for Bert & GPT2.
But there’s still a lot of meaningful and interesting work to do with LSTM, which later helps a lot when we try to do fine-tuning.
Transfer learning is one of the most important method to train a state-of-the-art NLP model after the Ulmfit Paper & Google Bert came out.
Fine-tuning Bert & GPT2 requires huge computation power, but the conclusion is — it totally worth it.
NLP is developing rapidly! Single Bert Base Model without any preprocessing or custom loss can easily reach a relatively high score, let alone great a mount of paper about advanced fine-tuning skill.
For us, plug and play fine-tuning Bert can slightly outperform an elaborate LSTM model. Then, we do a bunch of work to improve the Bert and copy and paste the same approach to fine-tuning a OpenAI GPT2 Model.
Here’s the AUC comparison of different model:
Model | AUC |
---|---|
Plug and Play Bert | 0.9367 |
Ensemble 6 different LSTM | 0.9368 |
Best Bert | 0.9415 |
Best GPT2 | 0.9388 |
Ensemble is a very powerful machine learning techique, we did’t spend a lot of time to do this, but simply blending all of the model can give us great boost. (about 0.003 boost from single best model)
All in all, it’s an interesting competition. We learned a lot from other kagglers. We realize that all the model is useful for some aspect. LSTM is good for doing experiment, Bert & GPT2 is really powerful and accurate.
Finally our reach about 0.9446 on the LB, which ranks about 75.