项目作者: Manny-Brar

项目描述 :
Capstone Project for Udacity's Data Engineering Nanodegree ~ ETL pipeline with very basic ML implementation to predict future Cryptocurrency prices
高级语言: Jupyter Notebook
项目地址: git://github.com/Manny-Brar/DataEngineeringNanodegree-Capstone-Project.git


Introduction

In this Capstone project, I wanted to perform an ETL process on a dataset of my choosing and I chose to work with some historical cryptocurrency data for Litecoin(LTC) and Ethereum(ETH). I wanted to create a ETL pipeline that would also include some basic machine learning implementations using linear regression and decision tree models to predict future closing prices.
The pipeline will extract csv and json data from an S3 bucket and will perform data cleaning tasks and implement the ML models and save the results into a new table. Then the tables are written to another S3 bucket in parquet format, for analytical use.

Data Source


https://www.kaggle.com/tencars/392-crypto-currency-pairs-at-minute-resolution


Tools


AWS S3, PySpark, AWS EMR, Jupyter Notebook

Data

LTC Dataset


LTC data shape (1663435, 6)

columns= ‘time’,’open’,’close’,’high’,’low’,’volume’

values= ‘1368976980000’,’3.1491’,’3.1491’,’3.1491’,’3.1491’,’10.000000’

Definitions=
‘time’= Timestamp of trades made, represented as Epoch
‘open’= cryptocurrency price at day open
‘high’= price max for the day or highest value of currency for the day
‘low’= price min for the day or lowest value of currency for the day
‘volume’= Size or volume of trades per timestamp

ETH Dataset


columns= ‘time’,’open’,’close’,’high’,’low’,’volume’

values= {‘time’: 1595809620000, ‘open’: 312.89333, ‘close’:…}

Definitions=
‘time’= Timestamp of trades made, represented as Epoch
‘open’= cryptocurrency price at day open
‘high’= price max for the day or highest value of currency for the day
‘low’= price min for the day or lowest value of currency for the day
‘volume’= Size or volume of trades per timestamp

Implementation

  1. Create AWS account and obtain credentials for dl.cfg file
  2. Open terminal and execute LTC_ETL.py for Litecoin(LTC) or ETH_ETL.py for Ethereum(ETH)
  3. Alternatively you can run through the Jupyter notebook

Future Scenarios and Update

Data Updates


For this project I used a batch approach and in the future, this data would ideally be set up for streaming and real time live predictions of cryptocurrency prices.

Data Size


If the data size was increased by a 100x, ideally we would make some changes to architecture and would have different tools and options available but in regards to scaling up I think AWS EMR is a fantastic option and really allows for more customization for you Spark clusters. However if we were setting up live data streaming, I believe looking at Redshift would make sense, as you can have a cluster running 24/7, where as with EMR is ideal if you dont need to run your cluster 24/7.

Daily Run


For scheduling a daily run of the ETL pipeline, using Airflow would be a must. Would need to set up DAG’s and adjust python scripts to implement running the pipeline daily. Would not suggest running through a notebook environment for actual implementation.

Access


If the database needed to be accessed by 100+ people, it would not really pose a major issue at all. The AWS access protocols would need to be followd for the larger user scale. S3 access, IAM Users and cluster access would need to be assessed and making sure the appropriate user has the appropriate access.