项目作者: RuthLD

项目描述 :
Using the ETL process to clean and merge data.
高级语言: Jupyter Notebook
项目地址: git://github.com/RuthLD/Movie_ETL.git
创建时间: 2021-03-29T20:21:38Z
项目社区:https://github.com/RuthLD/Movie_ETL

开源协议:

下载


Movie_ETL

Using the ETL process to clean and merge data.

Goal

📽️ Extract the movie data from Wikipedia and Kaggle from their respective files, transform the datasets by cleaning them and merging them together, then load the cleaned dataset into a SQL database.

ETL Process

Two examples of how the movie information from Wikipedia was cleaned is the identifican of alternate titles for the films and the standardization of the column names.

  • alt_titles.png
  • column_names.png

One other way the information was condesned was to filter out TV programs using a if statement.

  • no_tv.png

Tables in Database

The “movies” table contains 6,052 rows based on the kaggle and wikipedia data.

  • movie_query.png

The “ratings” table includes 26,024,289 rows of data.

  • ratings_query.png