ETL Data pipeline to download tweets using Tweepy and the twitter-streaming-api, save in an MySQL db, and analyze tweet sentiments.
Download tweets with twitter api, load in an MySQL db, and analyze tweet sentiments all in an ETL pipeline.
requirements.txt
.
$ pip install virtualenv
$ virtualenv venv/bin/activate
$ source venv/bin/activate
$ pip install -r requirements.txt
Run setup_configuration.py
to use a command prompt to enter all required configurations and generate a twitter_configuration.ini
file.
Or, manually add the required CONSUMER_KEY
, CONSUMER_SECRET
, ACCESS_TOKEN
, ACCESS_TOKEN_SECRET
, MYSQL_DATABASE
, MYSQL_TABLE
, and MYSQL_PASSWORD
to the twitter_sent_analysis/twitter_configuration.ini
file. (Warning: DO NOT UPLOAD THIS CONFIGURATION FILE ONLINE
)
Install MySQL server and set up a database instance to store the downloaded tweets. For example CREATE DATABASE twitter_db;
Based on the Twitter documentation online, we create a TABLE with the following sample schema present in twitter_sent_analysis/sql/TWEETS_BY_KEYWORD_schema.sql
for downloading tweets by keyword. The sample schema for downloading tweets by userid are present in twitter_sent_analysis/sql/TWEETS_BY_USERID_schema.sql
:
CREATE TABLE TWEETS_BY_KEYWORD (
ID INT AUTO_INCREMENT,
tweet_id VARCHAR(255) NOT NULL,
tweet TEXT NOT NULL,
created_at VARCHAR(50),
tweet_place VARCHAR(255), /* Nullable object */
favorite_count INT(11), /* Nullable object */
retweet_count INT(11) NOT NULL,
reply_count INT(11) NOT NULL,
user_name VARCHAR(255) NOT NULL,
user_location VARCHAR(255), /* Nullable object */
user_followers_count INT(11) NOT NULL,
user_friends_count INT(11) NOT NULL,
PRIMARY KEY (ID)
);
twitterdb
using:
$ mysql -u root -p twitter_db < TWEETS_BY_KEYWORD_schema.sql;
$ mysql -u root -p twitter_db < TWEETS_BY_USERID_schema.sql;
IMPORTANT: The MySQL Table names must be changed in twitter_configuration.ini
to save data in different tables or generate sentiment.
Make sure the Twitter
secret keys and tokens are set up in twitter_configuration.ini
.
From inside the twitter_sent_analysis
directory, run:
$ python download_tweets_data_as_json.py [-FLAG] [FILENAME]
-FLAG
can be set to -t
for keyword
search -f
for userid
search and -l
for location search.
The input files should be a files containing the search terms in each line seperated by newline chars. (i.e. see inputs/keywords.txt
)
Example:
inputs/keywords.txt
:
$ python download_tweets_data_as_json.py -t inputs/keywords.txt
Make sure the MYSQL_TABLE
is set to the correct table for the download_type
. i.e. For downloading using keyword filters, inside twitter_configuration.ini
, set TABLE
to TWEETS_BY_KEYWORD
or the relevant table.
From inside the twitter_sent_analysis
directory, run:
$ python download_tweets_data_to_mysql.py [-FLAG] [FILENAME]
Note:
$ python download_tweets_data_to_mysql.py -h
-FLAG
can be set to -t
for keyword
search -f
for userid
search and -l
for location search.
The input files should be a files containing the search terms in each line seperated by newline chars. (i.e. see inputs/keywords.txt
)
Example:
inputs/keywords.txt
:
$ python download_tweets_data_to_mysql.py -t inputs/keywords.txt
inputs/userids.txt
:
$ python download_tweets_data_to_mysql.py -f inputs/userids.txt
inputs/locations.txt
:
$ python download_tweets_data_to_mysql.py -l inputs/locations.txt
inputs/userids.txt
and keywords from inputs/keywords.txt
:
$ python download_tweets_data_to_mysql.py -f inputs/userids.txt -t inputs/keywords.txt
For more documentation: go to the Twitter Streaming API documentation page
After the tweets have been loaded into the MySQL database, the gen_tweets_sentiment_from_mysql.py
can generate a tweets csv file, sentiment results, and a wordcloud based on word-frequency.
$ python gen_tweets_sentiment_from_mysql.py -sent
$ python gen_tweets_sentiment_from_mysql.py -csv [csv_filename]
$ python gen_tweets_sentiment_from_mysql.py -wc [wc_filename]
Note: The flag options can be used together as well:
To generate sentiment, the word cloud and the csv tweet file:
$ python gen_tweets_sentiment_from_mysql.py -wc [wc_filename] -csv [csv_filename] - sent
Sample wordcloud from tweets downloaded based on keywords ‘batman’ and ‘joker’.
Preprocessing steps for Natural Language Processing
Normalization Convert all words to lowercase. Remove single chars which do not give much information
Removing extraneous information Remove stop words (i, the, a, an, nltk library has a decent list), punctuation, diacritical marks, and HTML
Tokenization Convert text into tokens using TextBlob
Lemmatisation Convert words to their canonical form (i.e. eating and ate to eat)
Term Frequency-Inverse Document Frequency (TF-IDF) Checking importance of words based on Frequency across main document or other multiple documents
We pass our pre-processed text into the TextBlob class and run the sentiment.polarity
method of the object to a get a sentiment scores between -1 and 1 that can be converted to integers -1, 0, or 1 signalling a negative, neutral, or positive sentiment respectively.
Data Description
The dataset is comprised of tsv
files with phrases from the Rotten Tomatoes dataset. Sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId
. Each sentence has a SentenceId
. Phrases that are repeated (such as short/common words) are only included once in the data.
train.tsv
contains the phrases and their associated sentiment labels. SentenceId can be used to track which phrases belong to a single sentence.
We attempt to assign a sentiment label to each phrase in test.tsv
which contains just phrases.
The sentiment labels are:
requirements.txt
With the venv
activated, from inside the rotten_tomatoes_movie_reviews_sent_analysis
directory, run:
$ python gen_rt_review_sentiment.py