项目作者: aisuresh

项目描述 :
Automation pipeline to ingest and calculate data in BigData Platform using Hadoop, Hive, Spark.
高级语言: Java
项目地址: git://github.com/aisuresh/Health-Record-ETL-Big-Data.git
创建时间: 2019-07-10T15:27:49Z
项目社区:https://github.com/aisuresh/Health-Record-ETL-Big-Data

开源协议:

下载


Health Record ETL Big Data application

Automation pipeline to ingest and calculate data in BigData Platform using Hadoop HDFS, Hive, Spark.

Prerequisites

  1. Download the Hortonworks Sandbox (VirtualBox / Docker) from
  2. https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
  3. Ensure VirtualBox (v6.0) / Docker installed
  4. Please note that a computer with minimum 10 GB RAM dedicated to the VirtualBox / Docker required

Environment setup

  1. Import Hortonworks Sandbox

Open VirtualBox and navigate to File -> Import Appliance. Select the sandbox image you downloaded and click Open.

Make sure to allocate at least 10 GB (10240 MB) of RAM for the sandbox.

Click Import and wait for VirtualBox to import the sandbox.

  1. Start Hortonworks Sandbox
    Once the sandbox has finished being imported, you may start it by selecting the sandbox and clicking “Start” from the VirtualBox menu.

  1. A console window opens and displays the boot process. This process take a few minutes. When you see the following screen, you may begin using the sandbox.

  1. Map Sandbox IP to your desired hostname in the hosts file. Execute below command in linux shell window.
  1. echo '127.0.0.1 sandbox-hdp.hortonworks.com' | sudo tee -a /etc/hosts

SSH on to HDP sandbox

ssh root@sandbox-hdp.hortonworks.com -p 2201

Default password is root/hadoop

Please remember to change password after first time login!

Run ‘ambari-admin-password-reset’ command to reset admin deafult password. (e.g. admin/admin)

Access Shell Web Client

Alternatively you can access sandbox from shell web client (remember to add hostname sandbox-hdp.hortonworks.com)
http://sandbox-hdp.hortonworks.com:4200/ or http://localhost:4200

The default login using credentials: root / hadoop

Must change the password when you login for first time. The new password used for accesssing sandbox through ssh / web client.

Access HDP sandbox UI

Access HDP sandbox Ambari UI through http://sandbox-hdf.hortonworks.com:8080 or http://localhost:8080

User credentials:
raj_ops / raj_ops
maria_dev / maria_dev
admin/admin

Data Ingestion

SSH to sandbox
ssh root@sandbox-hdp.hortonworks.com -p 2201

HDFS

  1. download source data file

    1. wget https://chronicdata.cdc.gov/views/735e-byxc/rows.csv
  2. hdfs make directory

    1. hdfs dfs -mkdir /assignment
  3. copy file from local file system to hadoop destination file system
    1. hdfs dfs -put rows.csv /assignment
  4. verify file copied in hadoop
    1. hdfs dfs -ls /assignment

Hive

  1. Create Hive table - It create table in default schema
    1. CREATE TABLE IF NOT EXISTS health_record(YearStart INT,YearEnd INT,LocationAbbr STRING,LocationDesc STRING,Datasource STRING,Class STRING,Topic STRING,Question STRING,Data_Value_Unit STRING,Data_Value_Type STRING,Data_Value DOUBLE,Data_Value_Alt DOUBLE,Data_Value_Footnote_Symbol STRING,Data_Value_Footnote STRING,Low_Confidence_Limit DOUBLE,High_Confidence_Limit DOUBLE,Sample_Size INT,Total STRING,`Age(months)` STRING,Gender STRING,`Race/Ethnicity` STRING,GeoLocation STRING,ClassID STRING,TopicID STRING,QuestionID STRING,DataValueTypeID STRING,LocationID INT,StratificationCategory1 STRING,Stratification1 STRING,StratificationCategoryId1 STRING,StratificationID1 STRING)
    2. COMMENT 'This is health record table'
    3. ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    4. tblproperties("skip.header.line.count"="1");

  1. Desribe table

    1. describe health_record;

  2. Load data from HDFS into Hive table

    1. LOAD DATA INPATH '/assignment/rows.csv' OVERWRITE INTO TABLE health_record;

  3. Check row count to verify data loaded correctly

    1. select count(*) from health_record;

  1. Query to check table data
    1. SELECT * FROM health_record limit 5;

We can view hive queries in Hive Query UI

Spark

Running Spark application

  1. Copy PySpark script health_avg_calculation.py to HDP sandbox:

    1. scp -p 2222 ./health_avg_calculation.py root@sandbox-hdp.hortonworks.com:/root
  2. Execute spark-submit to run the program:

    1. spark-submit health_avg_calculation.py \
    2. --master yarn-client \
    3. --num-executors 1 \
    4. --driver-memory 1g \
    5. --executor-memory 1g \
    6. --executor-cores 1

Navigate to Spark UI page to view spark job history:

The calculate average results are persist in Hive.

REST API

API to access results stored in Hive

Developed RESTful API service using Spring Boot to access results from Hive

Go to rest-api directory, run mvn install to build the project.

Ensure that port (7755) mentioned in src/main/application.properties is available to use.

Once build Once build is successful, it generate restapi-1.0-SNAPSHOT.jar in target directory.

Copy it to anywhere you want, execute java -jar ‘path of the restapi jar’. The below command bring up rest api server and listining to 7755 for api requests.

  1. java -jar target/restapi-1.0-SNAPSHOT.jar

REST API Request through curl

Open a new terminal window, call API using curl command:

To get list of database:

  1. curl http://localhost:7755/api/hive/database
  1. To get list of tables:
  1. curl http://localhost:7755/api/hive/table

To get the average of all age group:

  1. curl http://localhost:7755/api/hive/average

To get the average of female only:

  1. curl http://localhost:7755/api/hive/femaleAverage

Alternatively access APIs using rest api / browser url to get the API response

  1. http://localhost:7755/api/hive/femaleAverage

  1. http://localhost:7755/api/hive/average

The REST API response contains year, question, average.