PostgreSQL, Data Modeling, Star Schema, ETL, Data Engineering
Sparkify, a startup, wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. Their data is in a directory of JSON logs on user activity on the app and another directory with JSON metadata on the songs in their app.
The goal of this project is to create a Postgres databsae that allow analytics team to optimize queries on song play analysis.
The database structure uses Star Schema with following tables:
test.ipynb
displays the first few rows of each table.
create_tables.py
drops and creates tables.
etl.ipynb
reads and processes a single file from song_data and log_data and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py
reads and processes files from song_data and log_data and loads them into tables.
sql_queries.py
contains all sql queries, and is imported into the last three files above.