big_data

Big Data for beginners

Explore a variety of tutorials and interactive demonstrations focused on Big Data technologies like Hadoop, Spark, and more, primarily presented in the format of Jupyter notebooks. Most notebooks are self-contained, with instructions for installing all required services. They can be run on Google Colab or in a virtual Ubuntu machine/container.

Setting Up Hadoop: Single-Node Configuration

Hadoop_Setting_up_a_Single_Node_Cluster.ipynb
Set up a single-node Hadoop cluster on Google Colab and run some basic HDFS and MapReduce examples
Hadoop_single_node_cluster_setup_Python.ipynb Set up a single-node Hadoop cluster on Google Colab using Python
- Hadoop_minicluster.ipynb Deploy a test Hadoop Cluster with a single command and no need for configuration.

Running Apache Spark in Standalone Mode

Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb Set up a single-node Spark server on Google Colab and estimate „π“ with a Montecarlo method
Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb Set up a single-node Spark server on Google Colab using the Bigtop distribution and utilities, estimate „π“ with a Montecarlo method and run another Java ML example.
Run_Spark_on_Google_Colab.ipynb Set up a single-node standalone Spark server on Google Colab including Web UI and History Server - compact version
Spark_Standalone_Architecture_on_Google_Colab.ipynb Explore the Spark architecture through the immersive experience of deploying a standalone setup.

MapReduce Tutorials

MapReduce_Primer_HelloWorld.ipynb A MapReduce Primer with “Hello, World!”
MapReduce_Primer_HelloWorld_bash.ipynb A MapReduce Primer with “Hello, World! in Bash with just a few lines of code”
mapreduce_with_bash.ipynb An introduction to MapReduce using MapReduce Streaming and bash to create mapper and reducer
simplest_mapreduce_bash_wordcount.ipynb A very basic MapReduce wordcount example
mrjob_wordcount.ipynb A simple MapReduce job with mrjob
Hadoop_spilling.ipynb Hadoop spilling explained

PySpark Tutorials

PySpark_On_Google_Colab.ipynb Explore the inner workings of PySpark on Google Colab
PySpark_miscellanea.ipynb Tips, tricks, and insights related to PySpark.
demoSparkSQLPython.ipynb Pyspark basic demo
ngrams_with_pyspark.ipynb Basic example of n-grams extraction with PySpark
generate_data_with_Faker.ipynb Data Generation and Aggregation with Python’s Faker Library and PySpark
Encoding+dataframe+columns.ipynb DataFrame Column Encoding with PySpark and Parquet Format
Apache_Sedona_with_PySpark.ipynb Apache Sedona™ is a high-performance cluster computing system for processing large-scale spatial data, extending the capabilities of Apache Spark for advanced geospatial analytics. Run a basic example with PySpark on Google Colab

Miscellaneous Tutorials

GutenbergBooks.ipynb Explore and download books from the Gutenberg books collection.
TestDFSio.ipynb Demo of TestDFSio for benchmarking Hadoop clusters
Unicode.ipynb Exploring Unicode categories
polynomial_regression.ipynb Worked out example of polynomial regression with numpy and matplotlib
downloadSpark.ipynb How to download and verify the Spark distribution

Virtualization and Cloud Automation

docker_for_beginners.md Docker for beginners: an introduction to the world of containers
Terraform for beginners.md Getting started with Terraform
Terraform in 5 minutes A short introduction to Terraform, the powerful and popular tool for infrastructure provisioning and management

Big Data Learning Pathways

online_resources.md Online resources for learning Big Data

About this repository

Notebooks Testing and CI

Most executable Jupyter notebooks are tested on an Ubuntu virtual machine through a GitHub automated workflow. The log file for successful executions is named: action_log.txt (see also: Google Colab vs. GitHub Ubuntu Runner ).

Current status:

The Github workflow is a starting point for what is known as Continuous Integration (CI) in DevOps/Platform Engineering circles.