项目作者: P7h
项目描述 :
"Big Data" Specialization -- University of California, San Diego and Coursera
高级语言: Jupyter Notebook
项目地址: git://github.com/P7h/Coursera__UCSD__Big_Data_Specialization.git
“Big Data” Specialization — University of California, San Diego and Coursera
University of California, San Diego created a specialization on Coursera for “Big Data”. https://www.coursera.org/specializations/big-data
This repo contains solutions in Spark 2.x [and where possible even with 1.6.x] with Scala for Spark-specific questions in “Big Data” Specialization.
All solutions are in Apache Toree Jupyter notebooks and in Scala.
Big Data Integration and Processing
Final Project
Analytics to be determined processing 2 csv files one with tweets and another countries.
- As a Sports Analyst, you are interested in how many different countries are mentioned in the tweets. Use the Spark to calculate this number. Note that regardless of how many times a single country is mentioned, this country only contributes 1 to the total.
- Next, compute the total number of times any country is mentioned. This is different from the previous question since in this calculation, if a country is mentioned three times, then it contributes 3 to the total.
- Your next task is to determine the most popular countries. You can do this by finding the three countries mentioned the most.
- After exploring the dataset, you are now interested in how many times specific countries are mentioned. For example, how many times was France mentioned?
- Which country has the most mentions: Kenya, Wales, or Netherlands?
- Finally, what is the average number of times a country is mentioned?
Spark Joins