ETL process to S3 Data Lake through EMR, Spark, Hadoop, Schema-on-Read
This is to simulate a situation that a music startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to. So, this project aims to:
Song dataset is a subset of real data from the Million Song Dataset. The data is in JSON format and contains metadata about a song and the artist of that song.
Sample Song Data:
{"num_songs":1,"artist_id":"ARD7TVE1187B99BFB1","artist_latitude":null,"artist_longitude":null,"artist_location":"California - LA","artist_name":"Casual","song_id":"SOMZWCG12A8C13C480","title":"I Didn't Mean To","duration":218.93179,"year":0}
Log dataset is in JSON format generated by this event simulator based on the songs in the dataset above.
Sample Log Data:
{"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
s3a://udacity-dend/song-data/
s3a://udacity-dend/log-data/
songplays
- records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, userId, level, song_id, artist_id, session_id, location, userAgent, month, year
users
- users in the app
userId, firstName, lastName, gender, level
songs
- songs in music database
song_id, title, artist_id, year, duration
artists
- artists in music database
artist_id, name, location, latitude, longitude
time
- timestamps of records in songplays broken down into specific units
ts, start_time, hour, day, week, month, year, weekday
year
and month
The S3 bucket can be designated in Setup Configurations File)
songs, time, songplays
At the Root Project, create a config file named dl.config
with following configuration.
[AWS] # AWS IAM Credential
AWS_ACCESS_KEY_ID=<iam_access_key_id>
AWS_SECRET_ACCESS_KEY=<iam_secret_access_key>
[S3]
INPUT_DATA=s3a://udacity-dend/
OUTPUT_DATA=s3://<your_s3_bucket>/data_outputs/
This project is run on Spark cluster that was set up on Amazon EMR with follow setup
--use-default-roles
--release-label emr-5.32.0
--instance-count 2
--applications Name=Spark
--bootstrap-actions Path="s3://<your_S3_bucket>/path/to/bootstrap_emr.sh"
--ec2-attributes KeyName=spark-cluster
--instance-type m5.xlarge
--instance-count 3
Note:
ssh -i <key_pair_pem_file.pem> <local_file_path i.e. etl.py> hadoop@<EMR_MasterNode_endpoint>:~/<path_location>
After the file, and your EMR are ready state, then kick off the Spark job.
spark-submit etl.py --master yarn --deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g --executor-core 2
data
- a small data source in local for data profiling and testingdata_profiling.ipynb
- a Python notebook for data profiling to examine and summarize the data source before the actual pipeline implementationetl.py
- a Python file implementing the actual ETL data pipeline process for all datasetsrequirements.txt
- a text file containing all mandatory dependencies for the project