Data Engineering NanoDegree

Author

Deivid Robim Linkedin

Project 4: Data Lake and Apache Spark

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake.
Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes it using Spark, and loads the data back into S3 as a set of dimensional tables.
The etl.py script executes the following:

Create a Spark Session using the Apache Hadoop Amazon Web Services Module.
Ingest Log and Song data files from desired location. (configurable by dl.cfg)
Clean and Process the data:
- Add unique identifiers to Fact & Dimension tables
- Remove duplicates
- Impute nulls to desired values
- Parse timestamp into Time and Date components
- Create Dimension & Fact tables
Write final tables to S3

Project Structure

Data-Lakes-with-Spark
│   README.md              # Project description
|   requirements.txt       # Python dependencies
└───data # The datasets
|   |
│   └───log_data
│   |   │  ...
|   └───song_data
│       │  ...
│
└───src                   # Source code
|   |  etl.py             # ETL script
|   |  dl.cfg             # Configuration file
|   |  validation.ipynb   # Jupyter Notebook to validate data

Requirements for running locally

Python3
AWS Account

Datasets

Due to the slowness loading files from S3, you’ll be working with a subset of the two main datasets, which is provided in this repository:

Song data: data/song_data
Log data: data/log_data

REMEMBER: You can easily change the input data on the dl.cfg file.

Song dataset:

It’s a subset of real data from the Million Song Dataset.
Each file is in JSON format and contains metadata about a song and the artist of that song

{
    "num_songs":1,
    "artist_id":"ARD7TVE1187B99BFB1",
    "artist_latitude":null,
    "artist_longitude":null,
    "artist_location":"California - LA",
    "artist_name":"Casual",
    "song_id":"SOMZWCG12A8C13C480",
    "title":"I Didn't Mean To",
    "duration":218.93179,
    "year":0
 }

Log dataset:

It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above.
These simulate activity logs from a music streaming app based on specified configurations.

{
   "artist":null,
   "auth":"Logged In",
   "firstName":"Walter",
   "gender":"M",
   "itemInSession":0,
   "lastName":"Frye",
   "length":null,
   "level":"free",
   "location":"San Francisco-Oakland-Hayward, CA",
   "method":"GET",
   "page":"Home",
   "registration":1540919166796.0,
   "sessionId":38,
   "song":null,
   "status":200,
   "ts":1541105830796,
   "userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
   "userId":"39"
}

Fact Table

• songplays - records in log data associated with song plays i.e. records with page NextSong
  table schema: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables

• users - users in the app
  table schema: user_id, first_name, last_name, gender, level
• songs - songs in music database
  table schema: song_id, title, artist_id, year, duration
• artists - artists in music database
  table schema: artist_id, name, location, latitude, longitude
• time - timestamps of records in songplays broken down into specific units
  table schema: start_time, hour, day, week, month, year, weekday

Instructions for running locally

Clone repository to local machine

git clone https://github.com/drobim-data-engineering/Data-Lakes-with-Spark.git

Change directory to local repository

cd Data-Lakes-with-Spark

Create python virtual environment

python3 -m venv venv             # create virtualenv
source venv/bin/activate         # activate virtualenv
pip install -r requirements.txt  # install requirements (this can take couple of minutes)

Edit dl.cfg file

This file holds the configuration variables used on the scripts to create and configure the AWS resources.

These are the variables the user needs to set up before running the etl.py script.

AWS_ACCESS_KEY_ID = <ENTER AWS ACCESS KEY>   # paste your user Access Key
AWS_SECRET_ACCESS_KEY = <ENTER AWS SECRET KEY>  # paste your user Secret Key
REGION = <ENTER THE AWS REGION> # paste the AWS Region to create resources
OUTPUT_BUCKET = <ENTER THE OUTPUT BUCKET> # paste the AWS Bucket name to be created
INPUT_DATA = <ENTER THE INPUT DATA LOCATION> # paste the data location (for this exercise it is already set to read the data/ directory)

REMEMBER: Never share your AWS ACCESS KEY & SECRET KEY on scripts.

This is just an experiment to get familiarized with AWS SDK for Python.

Run script

cd src/
python -m etl.py # Entry point to kick-off a series of processes from creating Spark Session to write the modelled data to a S3 bucket.

Check results

In order to be able to run this script, you need a Spark environment set up.
Unfortunately, this is out of scope of this exercise. However, Udacity students can run it on the workspace environment.

jupyter notebook  # launch jupyter notebook app
# The notebook interface will appear in a new browser window or tab.
# Navigate to src/validation.ipynb and run sql queries against the datalake