Data Lakes with Spark

Overview of the project

This is to simulate a situation that a music startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to. So, this project aims to:

Build an ETL pipeline for a data lake hosted on S3
Load JSON data from S3
Process the data into analytics tables using Spark (Amazon EMR Spark)
Load data back into S3 as parquet files

System Architecture

alt text

Datasets

Song Data

Song dataset is a subset of real data from the Million Song Dataset. The data is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Data:

{"num_songs":1,"artist_id":"ARD7TVE1187B99BFB1","artist_latitude":null,"artist_longitude":null,"artist_location":"California - LA","artist_name":"Casual","song_id":"SOMZWCG12A8C13C480","title":"I Didn't Mean To","duration":218.93179,"year":0}

Log Data

Log dataset is in JSON format generated by this event simulator based on the songs in the dataset above.

Sample Log Data:

{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

ETL Pipeline Process

Load data from S3. Note that S3 bucket needs to be added into the config file. Check out Setup Configurations File

Song data: s3a://udacity-dend/song-data/
Log data: s3a://udacity-dend/log-data/

Process data using Spark and transform into 5 different tables as followed:
Fact table:

songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id, start_time, userId, level, song_id, artist_id, session_id, location, userAgent, month, year

Dimension tables:

users - users in the app

userId, firstName, lastName, gender, level

songs - songs in music database

song_id, title, artist_id, year, duration

artists - artists in music database

artist_id, name, location, latitude, longitude

time - timestamps of records in songplays broken down into specific units
```
ts, start_time, hour, day, week, month, year, weekday
```

Load processed data back to data lake resides in S3 by writing data into parquet files. Following tables are saved as partitioned parquet file, which partitioned by year and month
```
songs, time, songplays
```
The S3 bucket can be designated in Setup Configurations File)

Setup Configurations File

At the Root Project, create a config file named dl.config with following configuration.

[AWS] # AWS IAM Credential
AWS_ACCESS_KEY_ID=<iam_access_key_id>
AWS_SECRET_ACCESS_KEY=<iam_secret_access_key>
[S3]
INPUT_DATA=s3a://udacity-dend/
OUTPUT_DATA=s3://<your_s3_bucket>/data_outputs/

How to run

This project is run on Spark cluster that was set up on Amazon EMR with follow setup

--use-default-roles  
--release-label emr-5.32.0 
--instance-count 2 
--applications Name=Spark 
--bootstrap-actions Path="s3://<your_S3_bucket>/path/to/bootstrap_emr.sh" 
--ec2-attributes KeyName=spark-cluster 
--instance-type m5.xlarge 
--instance-count 3

Note:

Make sure that the file to be run is in your Spark cluster. In case the development environment is on your laptop local, then you need to move the file into your Spark edge node using SSH.
```
ssh -i <key_pair_pem_file.pem> <local_file_path i.e. etl.py> hadoop@<EMR_MasterNode_endpoint>:~/<path_location>
```
Make sure that your EMR DefaultRole has S3 access policy attached.

After the file, and your EMR are ready state, then kick off the Spark job.

spark-submit etl.py --master yarn --deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g --executor-core 2

Project Files

data - a small data source in local for data profiling and testing
data_profiling.ipynb - a Python notebook for data profiling to examine and summarize the data source before the actual pipeline implementation
etl.py - a Python file implementing the actual ETL data pipeline process for all datasets
requirements.txt - a text file containing all mandatory dependencies for the project

Project Author

Author: Thitiwat Watanajaturaporn
Note: this project is part of Udacity’s Data Engineering Nanodegree Program.