Data Engineer (Udacity): Project 3 Data Warehouse (DWH) with RedShift on Amazon Web Service (AWS)
Author: Tianlin He
Date: 19 Sep 2020
Tag: #Udacity #Data Engineering #AWS #RedShift #Python SDK
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that
You’ll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
You’ll be working with two datasets that reside in **S3 with the following link:
s3://udacity-dend/song_data
s3://udacity-dend/log_data
s3://udacity-dend/log_json_path.json
to load the Log data.The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song’s track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The log data are in JSON format files generated by this event simulator based on the songs in the dataset above. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:
The project is developed in a local enviornment (MacOS) and based on pure Python SDK codes.
The steps, including creating a cluster, create and load tables, testing, were pipelined in a Jupyter notebook etl.ipynb
sql_queries.py
to create empty tables, copy data from S3 and insert data into tablesdwh.cfg
dwh.cfg
DWH_ENTPONT
and IAM_ROLE
displayed to dwh.cfg
sql tool
create_table.py
to create empty tablesetl.py
(no modification is needed) to 1) copy data from S3 to staging 2) insert data from staging to tablessql
commandetl.ipynb
sql_queries.py
create_tables.py
(no modification is needed)etl.py
(no modification is needed)sample_dwh.cfg
is a copy of dwh.cfg
without key and secret, becuase they ought NOT to be accessible to the public
%sql SELECT COUNT(*) FROM staging_songs
count |
---|
14896 |
%sql SELECT COUNT(*) FROM staging_events
count |
---|
8056 |
songplays
table
%sql SELECT * FROM songplays LIMIT 5
songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
---|---|---|---|---|---|---|---|---|
0 | 2018-11-03 01:04:33 | None | free | None | None | 52 | None | None |
64 | 2018-11-03 17:39:13 | 15 | paid | None | None | 199 | Chicago-Naperville-Elgin, IL-IN-WI | “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36” |
128 | 2018-11-03 18:36:39 | 15 | paid | None | None | 199 | Chicago-Naperville-Elgin, IL-IN-WI | “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36” |
192 | 2018-11-05 06:32:52 | 49 | free | None | None | 224 | San Francisco-Oakland-Hayward, CA | Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0 |
256 | 2018-11-05 10:09:06 | 95 | paid | None | None | 222 | Winston-Salem, NC | “Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53” |
%%sql SELECT *
FROM songplays
WHERE start_time=(SELECT MAX(start_time) FROM songplays);
songplay_id | start_time | user_id | level | song_id | artist_id | session_id | location | user_agent |
---|---|---|---|---|---|---|---|---|
6266 | 2018-11-30 19:54:24 | 5 | free | None | None | 985 | Detroit-Warren-Dearborn, MI | “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.77.4 (KHTML, like Gecko) Version/7.0.5 Safari/537.77.4” |
%sql SELECT level, COUNT(*) FROM users GROUP BY level
level | count |
---|---|
free | 83 |
paid | 22 |