项目作者: c-stephenson

项目描述 :
Introductory Spark workshop - IPYNB notebook and data
高级语言: Jupyter Notebook
项目地址: git://github.com/c-stephenson/workshops-spark_intro.git
创建时间: 2020-04-22T15:45:29Z
项目社区:https://github.com/c-stephenson/workshops-spark_intro

开源协议:Apache License 2.0

下载


workshops-spark_intro

Introductory workshops for beginners in Apache Spark with Python (pyspark) and SQL (Spark SQL). Repository includes IPYNB notebooks and data.

Note: file paths in notebooks will require updating

I - Intro

Covers some core concepts using Spark for data analysis including:

  • Loading data
  • Spark SQL & basic data transformations
  • Writing data
  • Caching data for performance

II - Tidy Data

Demonstrates the concept of “Tidy Data” using example code in Apache Spark and tidying five common types of untidy data:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.