Udacity Data Engineer Nanodegree - Data Lake Spark ETL
Deivid Robim Linkedin
A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake.
Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes it using Spark, and loads the data back into S3 as a set of dimensional tables.
The etl.py
script executes the following:
Data-Lakes-with-Spark
│ README.md # Project description
| requirements.txt # Python dependencies
└───data # The datasets
| |
│ └───log_data
│ | │ ...
| └───song_data
│ │ ...
│
└───src # Source code
| | etl.py # ETL script
| | dl.cfg # Configuration file
| | validation.ipynb # Jupyter Notebook to validate data
Due to the slowness loading files from S3, you’ll be working with a subset of the two main datasets, which is provided in this repository:
data/song_data
data/log_data
REMEMBER: You can easily change the input data on the dl.cfg
file.
Song dataset:
It’s a subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song
{
"num_songs":1,
"artist_id":"ARD7TVE1187B99BFB1",
"artist_latitude":null,
"artist_longitude":null,
"artist_location":"California - LA",
"artist_name":"Casual",
"song_id":"SOMZWCG12A8C13C480",
"title":"I Didn't Mean To",
"duration":218.93179,
"year":0
}
Log dataset:
It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
These simulate activity logs from a music streaming app based on specified configurations.
{
"artist":null,
"auth":"Logged In",
"firstName":"Walter",
"gender":"M",
"itemInSession":0,
"lastName":"Frye",
"length":null,
"level":"free",
"location":"San Francisco-Oakland-Hayward, CA",
"method":"GET",
"page":"Home",
"registration":1540919166796.0,
"sessionId":38,
"song":null,
"status":200,
"ts":1541105830796,
"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId":"39"
}
• songplays - records in log data associated with song plays i.e. records with page NextSong
table schema: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
• users - users in the app
table schema: user_id, first_name, last_name, gender, level
• songs - songs in music database
table schema: song_id, title, artist_id, year, duration
• artists - artists in music database
table schema: artist_id, name, location, latitude, longitude
• time - timestamps of records in songplays broken down into specific units
table schema: start_time, hour, day, week, month, year, weekday
git clone https://github.com/drobim-data-engineering/Data-Lakes-with-Spark.git
cd Data-Lakes-with-Spark
python3 -m venv venv # create virtualenv
source venv/bin/activate # activate virtualenv
pip install -r requirements.txt # install requirements (this can take couple of minutes)
This file holds the configuration variables used on the scripts to create and configure the AWS resources.
These are the variables the user needs to set up before running the etl.py
script.
AWS_ACCESS_KEY_ID = <ENTER AWS ACCESS KEY> # paste your user Access Key
AWS_SECRET_ACCESS_KEY = <ENTER AWS SECRET KEY> # paste your user Secret Key
REGION = <ENTER THE AWS REGION> # paste the AWS Region to create resources
OUTPUT_BUCKET = <ENTER THE OUTPUT BUCKET> # paste the AWS Bucket name to be created
INPUT_DATA = <ENTER THE INPUT DATA LOCATION> # paste the data location (for this exercise it is already set to read the data/ directory)
REMEMBER: Never share your AWS ACCESS KEY & SECRET KEY on scripts.
This is just an experiment to get familiarized with AWS SDK for Python.
cd src/
python -m etl.py # Entry point to kick-off a series of processes from creating Spark Session to write the modelled data to a S3 bucket.
In order to be able to run this script, you need a Spark environment set up.
Unfortunately, this is out of scope of this exercise. However, Udacity students can run it on the workspace environment.
jupyter notebook # launch jupyter notebook app
# The notebook interface will appear in a new browser window or tab.
# Navigate to src/validation.ipynb and run sql queries against the datalake