Indonesian-Twitter-Emotion-Dataset

This dataset contains 4.403 Indonesian tweets which are labeled into five emotion classes: love, anger, sadness, joy and fear.

Data Format

Each line consists of a tweet and its respective emotion label separated by semicolon (,). The first line is a header. For a tweet with coma (,) inside the text, there is an quote (“ “) to avoid column separation.

The tweets in this dataset has been pre-processed using the following criterias:

Username mention (@) has been replaced with term [USERNAME]
URL/hyperlink (http://… or https://…) has been replaced with term [URL]
Sensitive number, such as phone number, invoice number and courier tracking number has been replaced with term [SENSITIVE-NO]

Pre-trained Word Embedding

We have trained 1 Millions Indonesian tweets into Word2Vec and FastText vector. Those pre-trained word embedding can be downloaded /g/personal/mei_silviana_office_ui_ac_id/EkwS7R4C2F9FnYHVLvvkmeoBGRpJ3abNa7Lti9ceG2TWFw?e=dRzvmG">here.

Citation

If you want to publish a paper using this dataset and pre-trained word embedding, please cite this publication:

Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani, “Emotion Classification on Indonesian Twitter Dataset“, in Proceeding of International Conference on Asian Language Processing 2018. 2018.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.