项目作者: software-competence-center-hagenberg

项目描述 :
Github Repository for a versatile usable Big Data infrastructure (AVUBDI)
高级语言: Shell
项目地址: git://github.com/software-competence-center-hagenberg/AVUBDI.git
创建时间: 2021-02-23T10:43:13Z

开源协议:Apache License 2.0



Github Repository for a Versatile Usable Big Data Infrastructure (AVUBDI) in Docker.

Development Environment

  • Dell XPS 7590
  • Intel Core i7-9750H (6 Cores)
  • 64 GB DDR4-2666 SODIMM Memory
  • 2TB NVMe PCIe M.2 SSD

Docker Host Environment

  • VMWare Workstation 15 Player
  • CentOS 8 + installed docker engine + compose
  • 50 GB Memory
  • 4 Cores

Big Data Components

We split the used big data components into 3 parts for better understanding.

Master Stack / Head Stack / Coordination Stack

This group consists of technologies responsible for data ingestion, distribution, validation, management and coordination.

Component Description Docker Image
Kafka Distributed and scaleable streaming platform that supports real-time & batch processing with high throughput. confluentinc/cp-kafka:5.5.0
Kafka Connect Kafka Connect is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems. confluentinc/cp-kafka-connect:5.5.0
Kafka Rest Proxy The Kafka REST Proxy provides a RESTful interface to a Kafka cluster. Examples of use cases include reporting data to Kafka from any frontend app built in any language, ingesting messages into a stream processing framework that doesn’t yet support Kafka, and scripting administrative actions. confluentinc/cp-kafka-rest:5.5.0
Schema Registry Schema Registry provides a serving layer for the metadata. It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas. It works like a charm in combination with Kafka and enables us to hold the whole infrastructure in a schema consistent state. confluentinc/cp-schema-registry:5.5.0
Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. confluentinc/cp-zookeeper:5.5.0

Slave Stack / Worker Stack / Analytical Stack

This group consists of technologies responsible for complex data analytics and visualization on stream and batch data.

Component Description Docker Image
Spark-Master Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. In this we can deploy any spark job. bde2020/spark-master
Spark-Worker(x2) Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. bde2020/spark-worker apache/zeppelin:0.9.0
InfluxDB InfluxDB is the leading open source time series database for monitoring metrics and events and providing real-time visibility into stacks, sensors, and systems. influxdb:1.8.0
Chronograf Chronograf is a visualization tool for time series data in InfluxDB. chronograf:1.8.4

Monitoring Stack / Management Stack

Component Description Docker Image
Kafka Connect UI Kafka Connect UI is a web tool for Kafka Connect for setting up and managing connectors for multiple connect clusters. landoop/kafka-connect-ui
Kafka Cluster UI Kafdrop is a UI for monitoring Apache Kafka clusters. The tool displays information such as brokers, topics, partitions, and even lets you view messages. obsidiandynamics/kafdrop
Schema Registry UI The Schema Registry UI is a fully-featured tool for your underlying schema registry that allows visualization and exploration of registered schemas. landoop/schema-registry-ui
Docker Container Management UI Portainer is a lightweight management UI which allows easy management of the Docker host or Swarm cluster. portainer/portainer
Grafana Grafana is the open source analytics & monitoring solution for a lot of database (in our case InfluxDB). grafana/grafana:7.0.6


What is Docker Engine

Docker Engine is an open source containerization technology for building and containerizing your applications. Docker Engine acts as a client-server application with: A server with a long-running daemon process dockerd . APIs which specify interfaces that programs can use to talk to and instruct the Docker daemon.

Docker Engine

What is Docker Compose

Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services.

Docker Compose

Installation of Docker Engine


Install the yum-utils package (which provides the yum-config-manager utility) and set up the stable repository.

  1. sudo yum install -y yum-utils
  1. sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

Install the latest version of Docker Engine and containerd.

  1. sudo yum install docker-ce docker-ce-cli containerd.io

Start Docker

  1. sudo systemctl start docker

Install Docker Compose

  1. sudo curl -L "https://github.com/docker/compose/releases/download/1.26.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Make Docker Compose Binary an Executable

  1. sudo chmod +x /usr/local/bin/docker-compose

Verify that Docker Engine and Docker Compose is installed correctly by running the cogniplant docker-compose.yml file.

  1. sudo docker-compose up -d --build

The output should look like the following:

  1. [mmayr@localhost Cogniplant]$ docker-compose up -d
  2. Creating spark-master ... done
  3. Creating zookeeper-1 ... done
  4. Creating influxdb ... done
  5. Creating portainer ... done
  6. Creating cogniplant_chronograf_1 ... done
  7. Creating cogniplant_grafana_1 ... done
  8. Creating kafka-1 ... done
  9. Creating spark-worker-2 ... done
  10. Creating spark-worker-1 ... done
  11. Creating kafka-schema-registry ... done
  12. Creating kafdrop ... done
  13. Creating schema-registry-ui ... done
  14. Creating kafka-rest-proxy ... done
  15. Creating kafka-connect ... done
  16. Creating kafka-connect-ui ... done

Dashboard UIs


Use the virtualization host ip address for connecting to the different UIs. This IP and additionally the ports can be configured in the .env file!


Dashboard Portainer

Kafka Monitoring UI (Kafdrop)

Kafka Monitoring UI

Spark Stream & Batch Master UI

Spark Stream Master UI

Spark Batch Master UI

Kafka Connect UI

Kafka Connect UI

Schema Registry UI

Schema Registry UI



