Data Modeling with Postgres

Sparkify, a startup, wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. Their data is in a directory of JSON logs on user activity on the app and another directory with JSON metadata on the songs in their app.
The goal of this project is to create a Postgres databsae that allow analytics team to optimize queries on song play analysis.

Data Modeling Structure

The database structure uses Star Schema with following tables:

Fact Table

songplay: records in log data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
  Dimension Tables
users: users in the app

song_id, title, artist_id, year, duration

artists: artists in music database
- artist_id, name, location, latitude, longitude
time: timestamps of records in songplay broken down into specific units
- start_time, hour, day, week, month, year, weekday

How to Run

run create_tables.py
run etl.py

Files

test.ipynb displays the first few rows of each table.
create_tables.py drops and creates tables.
etl.ipynb reads and processes a single file from song_data and log_data and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py reads and processes files from song_data and log_data and loads them into tables.
sql_queries.py contains all sql queries, and is imported into the last three files above.

Data Modeling with Postgres

Data Modeling Structure

Fact Table

Dimension Tables

How to Run

Files