项目作者: WinThitiwat

项目描述 :
ETL process to S3 Data Lake through EMR, Spark, Hadoop, Schema-on-Read
高级语言: Jupyter Notebook
项目地址: git://github.com/WinThitiwat/Data_Lake_with_Spark.git
创建时间: 2020-12-03T22:11:06Z
项目社区:https://github.com/WinThitiwat/Data_Lake_with_Spark

开源协议:

下载


Data Lakes with Spark

Overview of the project

This is to simulate a situation that a music startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to. So, this project aims to:

  • Build an ETL pipeline for a data lake hosted on S3
  • Load JSON data from S3
  • Process the data into analytics tables using Spark (Amazon EMR Spark)
  • Load data back into S3 as parquet files

System Architecture

alt text

Datasets

Song Data

Song dataset is a subset of real data from the Million Song Dataset. The data is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Data:

  1. {"num_songs":1,"artist_id":"ARD7TVE1187B99BFB1","artist_latitude":null,"artist_longitude":null,"artist_location":"California - LA","artist_name":"Casual","song_id":"SOMZWCG12A8C13C480","title":"I Didn't Mean To","duration":218.93179,"year":0}

Log Data

Log dataset is in JSON format generated by this event simulator based on the songs in the dataset above.

Sample Log Data:

  1. {"artist":"Des'ree","auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":1,"lastName":"Summers","length":246.30812,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"PUT","page":"NextSong","registration":1540344794796.0,"sessionId":139,"song":"You Gotta Be","status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

ETL Pipeline Process

  1. Load data from S3. Note that S3 bucket needs to be added into the config file. Check out Setup Configurations File
  • Song data: s3a://udacity-dend/song-data/
  • Log data: s3a://udacity-dend/log-data/
  1. Process data using Spark and transform into 5 different tables as followed:

    Fact table:

  • songplays - records in log data associated with song plays i.e. records with page NextSong

    1. songplay_id, start_time, userId, level, song_id, artist_id, session_id, location, userAgent, month, year

    Dimension tables:

  • users - users in the app

    1. userId, firstName, lastName, gender, level
  • songs - songs in music database

    1. song_id, title, artist_id, year, duration
  • artists - artists in music database

    1. artist_id, name, location, latitude, longitude
  • time - timestamps of records in songplays broken down into specific units

    1. ts, start_time, hour, day, week, month, year, weekday
  1. Load processed data back to data lake resides in S3 by writing data into parquet files. Following tables are saved as partitioned parquet file, which partitioned by year and month
    1. songs, time, songplays
    The S3 bucket can be designated in Setup Configurations File)

Setup Configurations File

At the Root Project, create a config file named dl.config with following configuration.

  1. [AWS] # AWS IAM Credential
  2. AWS_ACCESS_KEY_ID=<iam_access_key_id>
  3. AWS_SECRET_ACCESS_KEY=<iam_secret_access_key>
  4. [S3]
  5. INPUT_DATA=s3a://udacity-dend/
  6. OUTPUT_DATA=s3://<your_s3_bucket>/data_outputs/

How to run

This project is run on Spark cluster that was set up on Amazon EMR with follow setup

  1. --use-default-roles
  2. --release-label emr-5.32.0
  3. --instance-count 2
  4. --applications Name=Spark
  5. --bootstrap-actions Path="s3://<your_S3_bucket>/path/to/bootstrap_emr.sh"
  6. --ec2-attributes KeyName=spark-cluster
  7. --instance-type m5.xlarge
  8. --instance-count 3

Note:

  1. Make sure that the file to be run is in your Spark cluster. In case the development environment is on your laptop local, then you need to move the file into your Spark edge node using SSH.
    1. ssh -i <key_pair_pem_file.pem> <local_file_path i.e. etl.py> hadoop@<EMR_MasterNode_endpoint>:~/<path_location>
  2. Make sure that your EMR DefaultRole has S3 access policy attached.

After the file, and your EMR are ready state, then kick off the Spark job.

  1. spark-submit etl.py --master yarn --deploy-mode client --driver-memory 4g --num-executors 2 --executor-memory 2g --executor-core 2

Project Files

  • data - a small data source in local for data profiling and testing
  • data_profiling.ipynb - a Python notebook for data profiling to examine and summarize the data source before the actual pipeline implementation
  • etl.py - a Python file implementing the actual ETL data pipeline process for all datasets
  • requirements.txt - a text file containing all mandatory dependencies for the project

Project Author

  • Author: Thitiwat Watanajaturaporn
  • Note: this project is part of Udacity’s Data Engineering Nanodegree Program.