项目作者: bytesbysophie
项目描述 :
[Data Engineering] This projects creates a cloud based Redshift data warehouse (AWS/ Amazon cloud) and transform and loads data from S3 into the database.
高级语言: Python
项目地址: git://github.com/bytesbysophie/sparkify-redshift-data-warehouse.git
Creating a Redshift Data Warehouse on AWS for the Music Streaming Service Sparkify
Table of Contents
- Project Motivation and Description
- Installation
- File Descriptions
- Authors and Acknowledgements
" class="reference-link">Project Motivation and Description
The analytics team of the fictional music streaming service Sparkify wants get enabled to understand what songs users are listening to.
This project aims to support this need by modeling log data which resides in S3 in JSON format and setting up a redshift data warehouse in the Amazon cloud to make the data available for analysis.
In more detail, a cloud based ETL pipeline has to be implemented to load data from S3, transform it and load it into the newly designed and created redshift database.
" class="reference-link">Installation
AWS:
- Create IAM user (dwhadmin) in the AWS management console
Files:
- Add AWS KEY and SECRET of the dwhadmin user into dwh_example.cfg and save it under dwh.cfg
- Run create_cluster.py to create an AWS Redshift data warehouse
- Add DWH_ENDPOINT and DWH_ROLE_ARN into dwh.cfg (they get logged in step 2)
- Run create_tables.py to create the tables in AWS Redshift
- Run etl.py to load data from staging tables to analytics tables on Redshift
Make sure to delete your redshift cluster afterwards (you can use drop_cluster.py) if not needed anymore to prevent unnecessary costs.
To check if your cluster is still running use check_running_cluster.py.
" class="reference-link">File Descriptions
- check_running_cluster.py: Returns a list of running redshift cluster of the attached user
- create_cluster.py: Creates the AWS Redshift data warehouse and dwh role arn
- create_tables.py: Creates the fact and dimension tables defined in sql_queries.py in Redshift
- drop_cluster.py: Deletes the redshift dwh and iam role created in create_cluster.py
- dwh_example.cfg: An example of the configuration file
- etl.py: Implements the etl pipeline that loads data from S3 into statging tables on Redshift, processes them and finally loads them into the Redshift analytics tables
- sql_queries.py: Contains all SQL statements needed within the above files
" class="reference-link">Authors and Acknowledgements
This project has been implemented as part of the Udacity Data Engineering Nanodegree program. The data has been provided by Udacity accordingly as well as the project structure/ file templates.