Automation pipeline to ingest and calculate data in BigData Platform using Hadoop, Hive, Spark.
Automation pipeline to ingest and calculate data in BigData Platform using Hadoop HDFS, Hive, Spark.
Download the Hortonworks Sandbox (VirtualBox / Docker) from
https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
Ensure VirtualBox (v6.0) / Docker installed
Please note that a computer with minimum 10 GB RAM dedicated to the VirtualBox / Docker required
Open VirtualBox and navigate to File -> Import Appliance. Select the sandbox image you downloaded and click Open.
Make sure to allocate at least 10 GB (10240 MB) of RAM for the sandbox.
Click Import and wait for VirtualBox to import the sandbox.
echo '127.0.0.1 sandbox-hdp.hortonworks.com' | sudo tee -a /etc/hosts
ssh root@sandbox-hdp.hortonworks.com -p 2201
Default password is root/hadoop
Please remember to change password after first time login!
Run ‘ambari-admin-password-reset’ command to reset admin deafult password. (e.g. admin/admin)
Alternatively you can access sandbox from shell web client (remember to add hostname sandbox-hdp.hortonworks.com)
http://sandbox-hdp.hortonworks.com:4200/ or http://localhost:4200
The default login using credentials: root / hadoop
Must change the password when you login for first time. The new password used for accesssing sandbox through ssh / web client.
Access HDP sandbox Ambari UI through http://sandbox-hdf.hortonworks.com:8080 or http://localhost:8080
User credentials:
raj_ops / raj_ops
maria_dev / maria_dev
admin/admin
SSH to sandbox
ssh root@sandbox-hdp.hortonworks.com -p 2201
download source data file
wget https://chronicdata.cdc.gov/views/735e-byxc/rows.csv
hdfs make directory
hdfs dfs -mkdir /assignment
hdfs dfs -put rows.csv /assignment
hdfs dfs -ls /assignment
CREATE TABLE IF NOT EXISTS health_record(YearStart INT,YearEnd INT,LocationAbbr STRING,LocationDesc STRING,Datasource STRING,Class STRING,Topic STRING,Question STRING,Data_Value_Unit STRING,Data_Value_Type STRING,Data_Value DOUBLE,Data_Value_Alt DOUBLE,Data_Value_Footnote_Symbol STRING,Data_Value_Footnote STRING,Low_Confidence_Limit DOUBLE,High_Confidence_Limit DOUBLE,Sample_Size INT,Total STRING,`Age(months)` STRING,Gender STRING,`Race/Ethnicity` STRING,GeoLocation STRING,ClassID STRING,TopicID STRING,QuestionID STRING,DataValueTypeID STRING,LocationID INT,StratificationCategory1 STRING,Stratification1 STRING,StratificationCategoryId1 STRING,StratificationID1 STRING)
COMMENT 'This is health record table'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
tblproperties("skip.header.line.count"="1");
Desribe table
describe health_record;
Load data from HDFS into Hive table
LOAD DATA INPATH '/assignment/rows.csv' OVERWRITE INTO TABLE health_record;
Check row count to verify data loaded correctly
select count(*) from health_record;
SELECT * FROM health_record limit 5;
We can view hive queries in Hive Query UI
Copy PySpark script health_avg_calculation.py
to HDP sandbox:
scp -p 2222 ./health_avg_calculation.py root@sandbox-hdp.hortonworks.com:/root
Execute spark-submit to run the program:
spark-submit health_avg_calculation.py \
--master yarn-client \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1
Navigate to Spark UI page to view spark job history:
The calculate average results are persist in Hive.
Developed RESTful API service using Spring Boot to access results from Hive
Go to rest-api directory, run mvn install
to build the project.
Ensure that port (7755) mentioned in src/main/application.properties is available to use.
Once build Once build is successful, it generate restapi-1.0-SNAPSHOT.jar in target directory.
Copy it to anywhere you want, execute java -jar ‘path of the restapi jar’. The below command bring up rest api server and listining to 7755 for api requests.
java -jar target/restapi-1.0-SNAPSHOT.jar
Open a new terminal window, call API using curl command:
To get list of database:
curl http://localhost:7755/api/hive/database
curl http://localhost:7755/api/hive/table
To get the average of all age group:
curl http://localhost:7755/api/hive/average
To get the average of female only:
curl http://localhost:7755/api/hive/femaleAverage
Alternatively access APIs using rest api / browser url to get the API response
http://localhost:7755/api/hive/femaleAverage
http://localhost:7755/api/hive/average
The REST API response contains year, question, average.