项目作者: ssavvides

项目描述 :
TPC-H queries in Apache Spark SQL using native DataFrames API
高级语言: C
项目地址: git://github.com/ssavvides/tpch-spark.git
创建时间: 2015-07-08T13:25:11Z
项目社区:https://github.com/ssavvides/tpch-spark

开源协议:MIT License

下载


tpch-spark

TPC-H queries implemented in Spark using the DataFrames API.

The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
https://www.tpc.org/tpch


Getting started

Prerequisites: Apache Spark

tpch-spark requires that Spark is installed on your machine. You can download Spark from https://spark.apache.org/downloads.html. At a high level, to install Spark you have to:

  1. # Step 1:
  2. tar xvfz <the tgz file you downloaded>
  3. # Step 2:
  4. # [optionally move/rename the untarred directory wherever you want, say, $HOME/spark]
  5. # Step 3:
  6. export PATH=$PATH:$HOME/spark/bin
  7. # or better yet, add the above to your bashrc (or equivalent) and source it.

A. Get the code

  1. git clone https://github.com/ssavvides/tpch-spark
  2. cd tpch-spark

B. Generate input data tables

Navigate to the data generator directory dbgen and build the data generator:

  1. cd dbgen
  2. make

This should generate an executable called dbgen. Use the -h flag to see the various options the tool offers.

  1. ./dbgen -h

The simplest case is running the dbgen executable with no flags.

  1. ./dbgen

The above generates tables with extension .tbl with scale 1 (default) for a total of roughly 1GB size across all tables.

  1. $ ls -hl *.tbl
  2. -rw-rw-r-- 1 savvas savvas 24M May 28 12:39 customer.tbl
  3. -rw-rw-r-- 1 savvas savvas 725M May 28 12:39 lineitem.tbl
  4. -rw-rw-r-- 1 savvas savvas 2.2K May 28 12:39 nation.tbl
  5. -rw-rw-r-- 1 savvas savvas 164M May 28 12:39 orders.tbl
  6. -rw-rw-r-- 1 savvas savvas 114M May 28 12:39 partsupp.tbl
  7. -rw-rw-r-- 1 savvas savvas 24M May 28 12:39 part.tbl
  8. -rw-rw-r-- 1 savvas savvas 389 May 28 12:39 region.tbl
  9. -rw-rw-r-- 1 savvas savvas 1.4M May 28 12:39 supplier.tbl

For different size tables you can use the -s (scale) option. For example,

  1. ./dbgen -s 10

will generate roughly 10GB of input data.

Note that by default, dbgen uses a | as a column separator, and includes a | at the end of each entry.

  1. $ cat region.tbl
  2. 0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
  3. 1|AMERICA|hs use ironic, even requests. s|
  4. 2|ASIA|ges. thinly even pinto beans ca|
  5. 3|EUROPE|ly final courts cajole furiously final excuse|
  6. 4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|

You can find the schemas of the generated tables in the TPC-H specification

C. Build tpch-spark

tpch-spark is written in Scala as a self-contained Spark application.

Use the provided sbt file to build tpch-spark as a spark application.

  1. cd tpch-spark
  2. sbt package

The above command will package the application into a jar file, e.g., ./target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar which you will be needing in the next step.

D. Run tpch-spark

You can run all TPC-H queries from Q01 to Q22 by running:

  1. spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar

If you want to run a specific query you can use

  1. spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar <query number>

where <query number> is the number of the query to run, i.e., 1, 2, …, 22.

N.B.:

  • By default, tpch-spark will look for the input data files (the *.tbl files generated by dbgen) in "<current working directory>/dbgen". You can point to another location by setting the environment variable TPCH_INPUT_DATA_DIR.
  • By default, the query results will be stored in "${TPCH_INPUT_DATA_DIR}/output/{Q01, Q02, ...}, or to whatever location TPCH_QUERY_OUTPUT_DIR is set.
  • The execution times for each query run will be stored in a file with path "<current working directory>/tpch_execution_times.txt" or to whatever file path TPCH_EXECUTION_TIMES points to.

For example, to replace the default locations you can use:

  1. export TPCH_INPUT_DATA_DIR="$HOME/tpch-data"
  2. export TPCH_QUERY_OUTPUT_DIR="$HOME/tpch-results"
  3. export TPCH_EXECUTION_TIMES="$HOME/tpch-times.txt"

Other Implementations

  1. Data generator (http://www.tpc.org/tpch/)
  2. TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
  3. TPC-H for PIG (https://github.com/ssavvides/tpch-pig)