项目作者: pratikwatwani

项目描述 :
An ETL model designed using Postgres SQL for Sparkify database 🗄, modeling user activity data to create a database and ETL pipeline🔀 for a music streaming app 🎼.
高级语言: Jupyter Notebook
项目地址: git://github.com/pratikwatwani/ETL-pipeline-for-Sparkify.git
创建时间: 2020-05-27T22:53:27Z
项目社区:https://github.com/pratikwatwani/ETL-pipeline-for-Sparkify

开源协议:

下载


ETL Pipeline for Sparkify

Description

An ETL model designed using Postgres SQL for Sparkify database, modeling user activity data to create a database and ETL pipeline for a music streaming app.

Database

The database sparkify consists of 5 tables consisting fact and dimension tables.

There is one fact table :

  1. and four dimension tables:
  1. users
  2. songs
  3. artists
  4. time
    ```

Directory Structure

  1. ├── README.md
  2. ├── create_tables.py
  3. ├── etl.py
  4. ├── sql_queries.py
  5. ├── test.ipynb
  6. ├── data
  7. └── log_data
  8. └── 2018
  9. └── 11
  10. └── 2018-11-01-events.json
  11. └── 2018-11-012-events.json
  12. .
  13. .
  14. └── 22018-11-30-events.json
  15. └── song_data
  16. └── A
  17. └── A
  18. └── A
  19. └── B
  20. └── C
  21. └── B
  22. └── A
  23. └── B
  24. └── C

Running the Project

Step 1: In a shell, run create_tables.py with python create_tables.py which will call queries from sql_queries.py

Step 2: Run etl.py to execute the pipeline

Step 3: Use test.ipynb notebook to view tables and other SQL queries (EDA shown below)

Exploratory Data Analysis

1. Streaming by Location (Top 10)

  1. location count
  2. --------------------------------------------------
  3. San Francisco-Oakland-Hayward, CA | 691
  4. Portland-South Portland, ME | 665
  5. Lansing-East Lansing, MI | 557
  6. Chicago-Naperville-Elgin, IL-IN-WI | 475
  7. Atlanta-Sandy Springs-Roswell, GA | 456
  8. Waterloo-Cedar Falls, IA | 397
  9. Lake Havasu City-Kingman, AZ | 321
  10. Tampa-St. Petersburg-Clearwater, FL | 307
  11. San Jose-Sunnyvale-Santa Clara, CA | 292
  12. Sacramento--Roseville--Arden-Arcade, CA | 270

2. Streaming by Platform

  1. platforms count
  2. ----------------------
  3. Macintosh | 3000
  4. X11 | 1153
  5. Windows NT 5.1 | 955
  6. Windows NT 6.1 | 884
  7. Windows NT 6.3 | 576
  8. iPhone | 239
  9. compatible | 11
  10. Windows NT 6.2 | 2

3. Total album length for each artist (Top 10)

  1. artist total_album_minutes
  2. ----------------------------------------
  3. Clp | 605
  4. Faiz Ali Faiz | 599
  5. Montserrat Caballé | 511
  6. Blue Rodeo | 491
  7. John Wesley | 485
  8. Casual | 478
  9. Trafik | 424
  10. Jinx | 407
  11. Steve Morse | 364
  12. Terry Callier | 343

4. Streaming by week

  1. week count
  2. ----------------
  3. 1 | 6813

5. Service type weight

  1. level count
  2. -------------
  3. free | 1229
  4. paid | 5591

6. Type(Level) of user by Gender

  1. gender level sum
  2. -------------------------
  3. F | free | 820
  4. F | paid | 120
  5. M | free | 595
  6. M | paid | 28

7. Top artists by each location (Top 10)

  1. location name counts
  2. --------------------------------------------------------------------------
  3. San Francisco-Oakland-Hayward, CA | Marc Shaiman | 691
  4. Portland-South Portland, ME | Marc Shaiman | 665
  5. Lansing-East Lansing, MI | Marc Shaiman | 557
  6. Chicago-Naperville-Elgin, IL-IN-WI | Marc Shaiman | 475
  7. Atlanta-Sandy Springs-Roswell, GA | Marc Shaiman | 456
  8. Waterloo-Cedar Falls, IA | Marc Shaiman | 397
  9. Lake Havasu City-Kingman, AZ | Marc Shaiman | 321
  10. Tampa-St. Petersburg-Clearwater, FL | Marc Shaiman | 307
  11. San Jose-Sunnyvale-Santa Clara, CA | Marc Shaiman | 292
  12. Sacramento--Roseville--Arden-Arcade, CA | Marc Shaiman | 270

8. Average album length over years

  1. year average_album_length
  2. 0 | 245.57
  3. 1961 | 156.39
  4. 1964 | 164.81
  5. 1969 | 148.04
  6. 1972 | 342.57
  7. 1982 | 233.40
  8. 1984 | 269.82
  9. 1985 | 124.86
  10. 1986 | 307.38
  11. 1987 | 491.13
  12. 1992 | 133.33
  13. 1993 | 270.60
  14. 1994 | 270.67
  15. 1997 | 240.69
  16. 1999 | 236.25
  17. 2000 | 203.61
  18. 2003 | 136.44
  19. 2004 | 249.02
  20. 2005 | 262.09
  21. 2007 | 209.61
  22. 2008 | 149.86