项目作者: donjude

项目描述 :
This project is part of Udacity Nano degree programme. This project consists of building a data-warehouse using Amazon Redshift, creating an ETL process to load data from aws s3 storage into the data-warehouse using Python and SQL..
高级语言: Jupyter Notebook
项目地址: git://github.com/donjude/cloud-data-warehouse.git
创建时间: 2021-04-21T10:29:52Z
项目社区:https://github.com/donjude/cloud-data-warehouse

开源协议:

下载


Cloud Data Warehouse with Amazon Redshift

Project Summary

Sparkify is a music straming startup that has grown their user base and song database and would want to move their processes and data onto the cloud. Their data consist of user activities on using the music streaming up as well as metadata on the songs in their app. These data are currently stored on Amazon s3 storage bucket on Amazon Web Service.

Sparkify requires an ETL pipeline to be built that extracts their data from s3, stages the data in Redshift, and transforms the data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The database and ETL pipeline would be tested by running queries given by Sparkify analytics team and compare results with their expected results.

Project Description

In this Project an ETL pipeline would be built for a datawarehouse hosted on Amazon Redshift using Amazon Webservice(AWS) cloud computing platform. The ETL pipleine would load data from s3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.

Datasets

There are two datasets:

  • Song datasets: Each file is in JSON format and contains metadata about a song and the artist of that song.

    Location: s3://udacity-dend/song_data

    Sample:

    1. {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
  • Log datasets: The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app are based on specified configurations.

    Location: s3://udacity-dend/log_data

    log data JSON path: s3://udacity-dend/log_json_path.json

    Sample:

    1. {"artist":null, "auth":"Logged In", "firstName":"Walter", "gender":"M", "itemInSession":0, "lastName":"Frye", "length":null, "level":"free", "location":"San Francisco-Oakland-Hayward, CA", "method":"GET", "page":"Home", "registration":1540919166796.0, "sessionId":38,"song":null, "status":200, "ts":1541105830796, "userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId":"39"}

Database Schema

The database Schema consists of two staging tables to load the two datasets from Amazon s3 bucket into the data warehouse. There is also a star schema model (the analytics database) which consists of one fact table that contains all the measurable and quantitative data about each song. It refrences 4 different dimension tables that describes the attributes related to the fact table.

The database schema as described above has been built this way because the etl pipeline process requires a staging area where data can be cleaned, transformed and merged before inserting into the analytics database where sparkify analytics team can execute their queries.

Staging Events & Songs tables

The below two datas are the staging tables from which the files are loaded into before data transformation takes place.

  1. CREATE TABLE IF NOT EXISTS staging_events
  2. (
  3. artist VARCHAR
  4. ,auth VARCHAR
  5. ,firstName VARCHAR
  6. ,gender VARCHAR
  7. ,itemInSession INTEGER
  8. ,lastName VARCHAR
  9. ,length FLOAT
  10. ,level VARCHAR
  11. ,location VARCHAR
  12. ,method VARCHAR
  13. ,page VARCHAR
  14. ,registration FLOAT
  15. ,sessionId INTEGER
  16. ,song VARCHAR
  17. ,status INTEGER
  18. ,ts TIMESTAMP
  19. ,userAgent VARCHAR
  20. ,userId INTEGER
  21. );
  22. CREATE TABLE IF NOT EXISTS staging_songs
  23. (
  24. num_songs INTEGER
  25. ,artist_id VARCHAR
  26. ,artist_latitude FLOAT
  27. ,artist_longitude FLOAT
  28. ,artist_location VARCHAR
  29. ,artist_name VARCHAR
  30. ,song_id VARCHAR
  31. ,title VARCHAR
  32. ,duration FLOAT
  33. ,year INTEGER
  34. );

Fact Table

This contains the measurable and quantitative data about each song. Below is the single fact table and its SQL code.

songplays - records in log data associated with song plays.

  1. CREATE TABLE IF NOT EXISTS songplays
  2. (
  3. songplay_id INTEGER IDENTITY(0,1)
  4. ,start_time TIMESTAMP NOT NULL
  5. ,user_id INTEGER NOT NULL
  6. ,level VARCHAR(20)
  7. ,song_id VARCHAR(800) NOT NULL
  8. ,artist_id VARCHAR(800) NOT NULL
  9. ,session_id INTEGER NOT NULL
  10. ,location VARCHAR(800)
  11. ,user_agent VARCHAR(800)
  12. ,PRIMARY KEY(songplay_id)
  13. ,FOREIGN KEY(start_time) REFERENCES time(start_time)
  14. ,FOREIGN KEY(user_id) REFERENCES users(user_id)
  15. ,FOREIGN KEY(song_id) REFERENCES songs(song_id)
  16. ,FOREIGN KEY(artist_id) REFERENCES artists(artist_id)
  17. )
  18. DISTKEY(start_time)
  19. SORTKEY(start_time);

Dimension Tables

These tables references the fact table and contains descriptive attributes related to the fact table. Below are all the 4 tables and their sql codes.

users - users in the app

  1. CREATE TABLE IF NOT EXISTS users
  2. (
  3. user_id INTEGER NOT NULL
  4. ,first_name VARCHAR(256) NOT NULL
  5. ,last_name VARCHAR(256) NOT NULL
  6. ,gender VARCHAR(256)
  7. ,level VARCHAR(20)
  8. ,PRIMARY KEY(user_id)
  9. )
  10. SORTKEY(user_id);

songs - songs in music database

  1. CREATE TABLE IF NOT EXISTS songs
  2. (
  3. song_id VARCHAR(256)
  4. ,title VARCHAR(800) NOT NULL
  5. ,artist_id VARCHAR(800) NOT NULL
  6. ,year INTEGER
  7. ,duration FLOAT NOT NULL
  8. ,PRIMARY KEY(song_id)
  9. )
  10. SORTKEY(song_id);

artists - artists in music database

  1. CREATE TABLE IF NOT EXISTS artists
  2. (
  3. artist_id VARCHAR(800) NOT NULL
  4. ,name VARCHAR(800) NOT NULL
  5. ,location VARCHAR(800)
  6. ,lattitude FLOAT
  7. ,longitude FLOAT
  8. ,PRIMARY KEY(artist_id)
  9. )
  10. SORTKEY(artist_id);

time - timestamps of records in songplays broken down into specific units

  1. CREATE TABLE IF NOT EXISTS time
  2. (
  3. start_time TIMESTAMP NOT NULL
  4. ,hour INTEGER
  5. ,day INTEGER
  6. ,week INTEGER
  7. ,month INTEGER
  8. ,year INTEGER
  9. ,weekday VARCHAR(100)
  10. ,PRIMARY KEY(start_time)
  11. )
  12. DISTKEY(start_time)
  13. SORTKEY(start_time);

Project Setup

The project setup consist of all the scripts require to reproduce this project.

  • analytics.py - This script contains python functions that invokes the analytics queries from the sql_queries.py script and returns the results the terminal.
  • aws_sdk_create_redshift_cluster.ipynb - This notebook contains the Infrastruction as code (IAC) script to create an Amazon Redshift database and set up the necessary policies and configurations to enable the project to be run from the Shell/Terminal.
  • sql_queries.py - This script contains all the SQL queries for dropping existing tables, creating all the tables, inserting data into staging and analytics data tables and also running analytic queries.
  • create_table.py.py - This script contains python functions that establishes connection with the data warehouse and executes the sql_queries.py script to perform the droping of existing tables and creating of all the tables.
  • etl.py - This script implements the codes in the etl.ipynb notebook for a smooth extract transform and loading of the datasets into the database.

Run Scripts

  1. Install AWS Command Line Interface awscli from the Terminal/Bash Shell.

    On Linux:

    $ sudo apt install awscli

    On Windows:

    C:\> msiexec.exe /i https://awscli.amazonaws.com/AWSCLIV2.msi

  1. Run pip to install AWS Python SDK(boto3), pandas and psycopg2-binary.

    $ pip install boto3 pandas psycopg2-binary

  1. Run the below on your local git bash/Terminal to clone the github repository.

    $ git clone https://github.com/donjude/cloud-data-warehouse.git

  2. Create an AWS account. Create a configuration file with name dwh.cfg in the same folder as the notebook aws_sdk_create_redshift_cluster.ipynb, and fill in the necessary details as seen below. You may read more about IAM User, Access Key and Secret Key in the following link: Create IAM User.

  1. [AWS]
  2. KEY =
  3. SECRET =
  4. [CLUSTER]
  5. HOST =
  6. DB_NAME =
  7. DB_USER =
  8. DB_PASSWORD =
  9. DB_PORT = 5439
  10. [IAM_ROLE]
  11. ARN =
  12. [S3]
  13. LOG_DATA = 's3://udacity-dend/log_data'
  14. LOG_JSONPATH = 's3://udacity-dend/log_json_path.json'
  15. SONG_DATA = 's3://udacity-dend/song_data'
  16. [DWH]
  17. DWH_CLUSTER_TYPE = multi-node
  18. DWH_NUM_NODES = 4
  19. DWH_NODE_TYPE = dc2.large
  20. DWH_IAM_ROLE_NAME =
  21. DWH_CLUSTER_IDENTIFIER =
  22. DWH_DB =
  23. DWH_DB_USER =
  24. DWH_DB_PASSWORD =
  25. DWH_PORT = 5439
  1. In the cloned repository on your local computer, open the jupyter notebook file aws_sdk_create_redshift_cluster.ipynb, follow the instruction and run the notebook cells to create a Redshift Database on AWS.

  2. Run the python file create_tables.py in the terminal/shell to create the database staging and analytics tables.

    $ python create_tables.py

  3. Run the python file etl.py to execute the etl process of extracting data from staging to the analytics tables.

    $ python etl.py

  4. Run the analytics.py file to test the counts of records in each table and also query the tables to get results about which artist and songs has the highest and lowest streaming durations.

Query Results

The results below shows all the tables in the data warehouse and their record counts after executing the analytics query.

The last two details shows shows the artist and their songs that has the highest and the lowest streaming durations respectively.

Analytics query results

References:

  1. https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_examples.html
  2. https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-set-up.html