项目作者: fixcer

项目描述 :
bigdata
高级语言: Jupyter Notebook
项目地址: git://github.com/fixcer/bigdata.git
创建时间: 2020-11-11T14:24:32Z
项目社区:https://github.com/fixcer/bigdata

开源协议:

下载


BigData

Learn Big Data through its Python (PySpark) API by running the Jupyter notebooks with examples on how to read, process and write data.


Contents

Quick Start" class="reference-link">Quick Start

Cluster overview

Application URL
Hadoop localhost:9870
MapReduce localhost:8089
HUE localhost:8088
Mongo Cluster localhost:27017
Kafka Manager localhost:9000
JupyterLab localhost:8888
Spark Master localhost:8080

Prerequisites

Build from Docker Hub

  1. Download the source code or clone the repository
  2. Build the cluster
  1. docker-compose up -d
  2. ./config.sh
  1. Remove the cluster by typing
  1. docker-compose down

Tech" class="reference-link">Tech

Hadoop

Apache Spark Standalone Cluster

Mongo Sharded Cluster

WARNING (Windows & OS X)

The default Docker setup on Windows and OS X uses a VirtualBox VM to host the Docker daemon.
Unfortunately, the mechanism VirtualBox uses to share folders between the host system and
the Docker container is not compatible with the memory mapped files used by MongoDB
(see vbox bug, docs.mongodb.org
and related jira.mongodb.org bug).
This means that it is not possible to run a MongoDB container with the data directory mapped to the host.

– Docker Hub (source here
or here)

Mongo Components
  • Config Server (3 member replica set): configsvr01,configsvr02,configsvr03
  • 3 Shards (each a 3 member PSS replica set):
    • shard01-a,shard01-b, shard01-c
    • shard02-a,shard02-b, shard02-c
    • shard03-a,shard03-b, shard03-c
  • 2 Routers (mongos): router01, router02

References" class="reference-link">References