项目作者: mehd-io

项目描述 :
Pyspark boilerplate for running prod ready data pipeline
高级语言: Python
项目地址: git://github.com/mehd-io/pyspark-boilerplate-mehdio.git
创建时间: 2019-04-15T14:44:14Z
项目社区:https://github.com/mehd-io/pyspark-boilerplate-mehdio

开源协议:MIT License

下载


PySpark Boilerplate Mehdio :fire:

Introduction

Running a prod-ready pyspark app can be difficult for many reasons : packaging, handling extra jars, easy local testing.

This boilerplate solves those problems by providing :

  • Proper Folder structure for ETL applications
  • Logging, configuration, spark session helpers
  • Tests example (with data!)
  • Helpers functions for packaging your application and spark-submit command examples
  • A dev docker image (to be used with VS code or through the make docker commands) for a smooth local development.

This project is initially a fork from : https://github.com/AlexIoannides/pyspark-example-project

Requirements :

  • Docker
  • make
  • Any Cloud Service that can run pyspark (AWS Glue, AWS EMR, GCP DataProc…)

Development

Build the dev image

  1. make build-docker

Run the tests

  1. make test-docker

Run the spark job demo_job

  1. make run-docker

Folder Structure

  1. ├── LICENCE
  2. ├── Makefile
  3. ├── README.md
  4. ├── datajob
  5. ├── cli.py // Entry point of the spark job
  6. ├── configs // hold static config of the etl
  7. ├── helpers
  8. └── jobs // main logic
  9. ├── docker // for dev container
  10. ├── jars // extra jars (for reading from excel file)
  11. ├── poetry.lock
  12. ├── pyproject.toml
  13. ├── setup.cfg
  14. ├── setup.py
  15. └── tests // test with fixtures data

Configuration

Static configuration can be done through ./datajob/configs/etl_config.py
The job name is provided at run time through --job-name (here demo_job as value).
For instance :

  1. spark-submit \
  2. --jars jars/spark-excel_2.11-0.9.9.jar \
  3. --py-files datajob.zip \
  4. datajob/cli.py \
  5. --job-name demo_job

will run the job datajob/jobs/demo_job.py and the associated config value from ./datajob/configs/etl_config.py.

How to package my application and run a spark-submit job

Packaging your source code and python dependencies

You need the following files/zip :

  • python dependencies that include your spark code(as .zip)
  • your external jars if your cluster is not connected to the internet(see in this boilerplate with spark-excel lib to be able to read excel files with spark) or
  • File entry point : cli.py (located in datajob/cli.py)

To run a ready depencies folder as .zip run :

  1. make package

will generate a datajob.zip

Writing your first data pipeline

The differents jobs (data pipeline) will be put under datajob/jobs/my_job.py
All jobs are considered as module, so that you can launch a specific job directly from the spark-submit command with the “—job-name” argument.

E.g, we have “demo_job” module in datajob/jobs/demo_job.py

spark-submit […] datajob/cli.py —job-name demo_job

Launching your spark job

  1. spark-submit \
  2. --jars jars/spark-excel_2.11-0.9.9.jar \
  3. --py-files datajob.zip \
  4. datajob/cli.py \
  5. --job-name demo_job

--jars : Your local jar dependencies, if you are connected to the internet, use —package com.maven.path to directly pull from maven for example. In this boilerplate, is to show the use of a lib to read excel file in spark, see in demo.job.py.

--py-files: python libs, python source code

datajob/cli.py: the entry point file.

--job-name: custom job parameter Job name, which is a module.

Extra ressources

https://github.com/AlexIoannides/pyspark-example-project
https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f
https://stackoverflow.com/questions/47227406/pipenv-vs-setup-py
https://pipenv.readthedocs.io/en/latest/advanced/#pipfile-vs-setup-py
https://realpython.com/pipenv-guide/