U2 - AI&BA - M2 TBS - Big Data Tools for Business
You are big data analyst for a communication agency who wants to analyze
Donald Trump’s communication on Twitter.
You have an history of all Donald trump tweets from 2009 to November 19th,
2020 in the form of a text file (trump_tweets.txt).
Each line of this text file is in the form: text_of_the_tweet;date_of_the_tweet
Each tweet can be an original tweet or a retweet, each retweet starts with the
keyword ‘RT’
You are tasked to explore this data by using Spark, because your analysis should
also be able to apply on very large data sets distributed on a Hadoop cluster, for
instance to analyze the communication of other public figures.
You should present your keys findings in form of lists, tables or visualizations.
You can for instance search for:
The objective is to perform real estate data exploration of major French cities.
Your company (Immo-Inv) is a real estate agency who wants to understand very
well the real estate market in France.
You are the big data analyst of the company and you have access to a 5-years
data history of real estate transactions in France (real_estate_transactions.csv).
The dataset contains details for each transaction: sale date, localization (city,
postal code), type of residence, type of sale, land area, living area, number of
rooms, price, etc.).
You should use Spark for this analysis because you should be able to apply your
analysis to a dataset with the entire real estate market for all cities in France for
instance (big data file) distributed on a Hadoop cluster.
Challenges here are to explore all possible aspects of this real estate market
(variables, relationships between variables, trends, patterns, outliers, etc.). But
at the end you should focus on at least 5 keys findings (lists, tables or
visualizations) in your final notebook. You can also comment these findings. You
can explore for instance: