项目作者: pientaa

项目描述 :
Deep dive into Spark UDFs' characteristics.
高级语言: Scala
项目地址: git://github.com/pientaa/opening-black-box.git
创建时间: 2020-12-31T08:35:59Z
项目社区:https://github.com/pientaa/opening-black-box

开源协议:Apache License 2.0

下载


Opening a black-box

Run cluster locally

  1. If first time here -> download data with getdata.sh in database directory.
  2. Go to spark-config directory.
  3. Run run_cluster_locally.sh.
  4. Access spark-master UI at http://localhost:8080/
  5. Submit jar with submit.sh in black-box directory.

Run cluster remotely

Configure ssh connection

It’s recommended to use aliases for connection to cluster. Otherwise, some scripts won’t work. Modify ~/.ssh/config
following the pattern:

  1. Host <number_of_node>
  2. Port 22
  3. User magisterka
  4. HostName <node_ip_address>

Create env file with password

  1. cd scripts
  2. touch password.env
  3. echo <your_password> > password.env

Configure and run cluster

  1. scripts/prepare_nodes.sh <git_branch_to_checkout:-main>
  2. scripts/start_master.sh
  3. scripts/start_workers.sh

Generate TPC-DS data

Make sure that TPC-DS tool is available on master node - if not got
to database/README.

Parametrize the script with desired data size. Default value is 1 GB. TPC-DS enables to generate data from 1 GB to 10
TB.

  1. database/generate_tpc_ds.sh <data_size_in_GB>

Submit jar to the cluster with script

  1. scripts/sumbit.sh <function_name>

Expected output:

  1. % Total % Received % Xferd Average Speed Time Time Time Current
  2. Dload Upload Total Spent Left Speed
  3. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0{
  4. "action" : "CreateSubmissionResponse",
  5. "message" : "Driver successfully submitted as driver-20210402161642-0000",
  6. "serverSparkVersion" : "3.0.2",
  7. "submissionId" : "driver-20210402161642-0000",
  8. "success" : true
  9. 100 779 100 223 100 556 888 2215 --:--:-- --:--:-- --:--:-- 3103

Submit jar to the cluster via REST API

  1. curl --location --request POST '192.168.55.20:5000/submit' \
  2. --header 'Content-Type: application/json' \
  3. --data-raw '{
  4. "function_name": "averageTemperatureByDeviceIdSeason"
  5. }'

Expected response:

  1. {
  2. "action": "CreateSubmissionResponse",
  3. "message": "Driver successfully submitted as driver-20210407145229-0000",
  4. "serverSparkVersion": "3.0.2",
  5. "submissionId": "driver-20210407145229-0000",
  6. "success": true
  7. }

Get the driver status via REST API

  1. curl --location --request GET '192.168.55.20:5000/status' \
  2. --header 'Content-Type: application/json' \
  3. --data-raw '{
  4. "driver_id": "driver-20210407145229-0000"
  5. }'

Expected response:

  1. {
  2. "action": "SubmissionStatusResponse",
  3. "driverState": "FINISHED",
  4. "serverSparkVersion": "3.0.2",
  5. "submissionId": "driver-20210407145229-0000",
  6. "success": true,
  7. "workerHostPort": "10.5.0.6:40829",
  8. "workerId": "worker-20210407145657-10.5.0.6-40829"
  9. }

Stop cluster

  1. scripts/stop_all.sh

Potential problems

If you get error like

  1. Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded

Inspect docker network (spark-network) on the master node and make sure that it took addresses

  • 10.5.0.2
  • 10.5.0.3

Run experiments

Make sure you have hosts_info.csv file in monitor-manager directory. The file should end with an empty line.

host_ip container_name
192.168.55.20 spark-master
192.168.55.11 spark-worker-1
192.168.55.12 spark-worker-2
192.168.55.13 spark-worker-3

Create experiments plan csv file

For example:

function_name dataset_size iterations
countDistinctTicketNumber 1GB 25

Check if monitor-manager is running

  1. curl --location --request GET 'http://192.168.55.20:8888/'

Start experiments

  1. curl --location --request POST 'http://192.168.55.20:8888/experiments'

Get experiments data

  1. scripts/get_experiments_data.sh

Data preprocessing

To calculate mean value of CPU, RAM and duration of all experiment iterations (per node) run notebook/prepare_data.ipynb.

This notebook creates individual plots for all iterations (per node) and plots for the mean values of all iterations.
Calculated mean RAM and CPU values, with a new column containing experiment mean duration time are stored in new file experiment_mean_data.csv