Tweets preprocessor
Author: Igor Tannus Correa
This is a Java algorithm that executes several steps of pre-processing in a database. At first, it was written as a tweet’s pre-processor, but it can be adapted to other types of data.
The pre-processing steps that it can do are:
remove
hashtags and citations (#lalaland, @user -> lalaland, user)
tweets unrelated to the theme according to a list of words (add words to unrelated.txt)
links
special characters (e.g. ~!@#$%ˆ*&), numbers, and the query term
stopwords (e.g. a, the, you, with, etc)
spaces (when there’s more than one)
translate
slangs and abbreviations (e.g. omg, ily, brb -> oh my god, i love you, be right back — add words to dictionary.txt)
emoticons (e.g. :], <3 -> happy, love — add words to emoticons.txt)
replace
uppercase letters to lowercase
accented characters (ã, ê, ñ) to unaccented characters (a, e, n)
You can write new steps according to what you need or comment/delete the methods you don’t want to use.
This algorithm is part of the paper that I wrote, “Sentiment analysis of tweets related to the movies nominated for the 2017 Academy Awards”.
You can read it (in Portuguese or English) and understand how I used this tool in my paper.
If you use this tool, please cite the paper