ETL Workflow Automation with Apache Airflow
Automated data pipeline workflows using Apache Airflow that loads and processes data from Amazon S3 into an Amazon Redshift cloud data warehouse for analytics processing.
The analytics team for music streaming startup Sparkify wants to automate and better monitor their data warehouse ETL pipelines using the Apache Airflow open-source workflow management platform.
The goal of this project is to author a data pipeline workflow created with custom operators within the Airflow platform that perform tasks such as staging data, populating the data warehouse, and running quality checks. The end result will be a pipeline defintion as illustrated below.
The pipeline transforms the data into a set of fact and dimension tables in Redshift.
The project includes reuseable operators that have been implemented into functional pieces of the data pipeline.
Loads JSON formatted files from S3 into Redshift by running a SQL COPY statement based on parameter values for the S3 bucket and target Redshift table.
Loads output data from the stage operator into a fact table. SQL statement and target table are passed as parameters.
Loads data from the fact table into dimension tables. SQL statement and target table are passed as parameters.
Runs qualtiy checks on the transformed data.
In addition to files for implementing the above operator functionality, the project also includes:
sparkify_workflow_dag.py
- DAG (Directed Acyclic Graph) definition scriptredshift.ipynb
- Jupyter Notebook for creating the Redshift clustercreate_tables.sql
- SQL for creating Redshift tablessql_queries.py
- script with SQL for operators to import datadwh.cfg
- configuration values for AWS servicesExecute the steps in the Jupyter Notebook to create the Redshift cluster
An AWS IAM user with the following policies (or equivalent permissions) is required:
AmazonEC2FullAccess
The access key and secret key need to be added to the [AWS]
section in the dwh.cfg
file.
[AWS]
KEY=YOURACCESSKEYGOESHERE
SECRET=PUTyourSECRETaccessKEYhereTHISisREQUIRED
Run queries in create_tables.sql
Start Airflow and toggle “ON” the DAG named sparkify_workflow