Data pipeline project
(I am maintaining this project and add more demos for Hadoop distributed mode, Hadoop deployment on cloud, Spark high performance, Spark streaming application demos, Spark distributed cluster etc. Please give me some stars as support.)
../zookeeper/bin/./zkServer.sh start
../zookeeper/bin/./zkServer.sh status
../zookeeper/bin/./zkServer.sh stop
Nimbus master daemon
./storm nimbus
./storm supervisor
./storm ui
./storm logviewer
JZMQ
./storm jar <path-to-topology-jar> <class-with-the-main> <arg1> … <argN>
More configs : https://github.com/apache/storm/blob/master/conf/defaults.yaml
Topology
Worker
Executor
Task
Tuning paralel in Storm
Manage Storm
Machines
Storm and Zookeeper setup
Tech stack
export NOTEBOOKS_DIR=
pwd/notebooks
./bin/spark-notebook
Structured streaming API (is doing )
streaming job for video data
videoPlayed events process the timestamp embedded in the event to determine the time-based aggregation
VideoPlayed(video-id, client-id, timestamp)
DStream[VideoPlayed]
trackVideoHits function
http://<host>:4040/jobs/job/?id=0
/usr/local/Celler/hadoop
Check : https://www.slideshare.net/SunilkumarMohanty3/install-apache-hadoop-on-mac-os-sierra-76275019
http://localhost:50070/dfshealth.html#tab-overview
Start : hstart
Hadoop command:
hadoop fs -ls
hadoop fs -mkdir /hbp
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
http://localhost:50070/explorer.html#/hbp/ibm-stock
For development :https://github.com/kiwenlau/hadoop-cluster-docker
HDFS
http://{NAMENODE}:50070/
Distributed mode
yarn-site.xml
$HADOOP_HOME/sbin/start-yarn.sh
Distributed providers : HDP, Cloudera
head
awk
,sed
,grep
hadoop jar /hbp/ibm-stock/ibm-stock-1.0-SNAPSHOT.jar /hbp/ibm-stock/ibm-stock.csv /hbp/ibm-stock/output
hadoop fs -ls /hbp/ibm-stock/output
hadoop fs -get /hpb/ibm-stock/output/part-r-00000 home/Users/hien/results.csv
head home/Users/hien/results.csv
DFSIO
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar \
TestDFSIO -write -nrFiles 32 –fileSize 1000
Terasort
hadoop jar \
$HADOOP_HOME/share/Hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
teragen 10000000 tera-in
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
terasort tera-in tera-out
hadoop jar \
$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
teravalidate tera-out tera-validate
Export data
Set up stack:
Spark ecosystem :
Apache Spark component:
localhost:4040
run spark-shell : $SPARK_HOME/bin/spark-shell
Word count
valpairRDD=stringRdd.map( s => (s,1))
alwordCountRDD=pairRDD.reduceByKey((x,y) =>x+y)
valwordCountList=wordCountRDD.collect
Find the sum of integers
valintRDD = sc.parallelize(Array(1,4,5,6,7,10,15))
valevenNumbersRDD=intRDD.filter(i => (i%2==0))
val sum =evenNumbersRDD.sum
Count the number of words in file :
cat people.txt
val file=sc.textFile("/usr/local/spark/examples/src/main/resources/people.txt")
valflattenFile = file.flatMap(s =>s.split(", "))
flattenFile.collect
val count = flattenFile.count
spark-submit
script on the cluster
- Vavilapalli, et al. “Apache Hadoop YARN: Yet Another Resource Negotiator,” ACM Symposium on Cloud Computing, 2013. http://bit.ly/2Xn3tuZ.
- Nasir, M.A.U. “Fault Tolerance for Stream Processing Engines,” arXiv.org:1605.00928, May 2016. http://bit.ly/2Mpz66f.
- Lyon, Brad F. “Musings on the Motivations for Map Reduce,” Nowhere Near Ithaca blog, June, 2013, http://bit.ly/2Q3OHXe.
- Lin, Jimmy, and Chris Dyer. Data-Intensive Text Processing with MapReduce. Morgan & ClayPool, 2010. http://bit.ly/2YD9wMr.
- Kreps, Jay. “Questioning the Lambda Architecture,” O’Reilly Radar, July 2, 2014. https://oreil.ly/2LSEdqz.
- Kleppmann, Martin. “A Critique of the CAP Theorem,” arXiv.org:1509.05393, September 2015. http://bit.ly/30jxsG4.
- Halevy, Alon, Peter Norvig, and Fernando Pereira. “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems (March/April 2009). http://bit.ly/2VCveD3.
- Gibbons, J. “An unbounded spigot algorithm for the digits of π,” American Mathematical Monthly 113(4) (2006): 318-328. http://bit.ly/2VwwvH2.
- Fischer, M. J., N. A. Lynch, and M. S. Paterson. “Impossibility of distributed consensus with one faulty process,” Journal of the ACM 32(2) (1985): 374–382. http://bit.ly/2Ee9tPb.
- Dósa, Gÿorgy. “The Tight Bound of First fit Decreasing Bin-Packing Algorithm Is FFD(I)≤(11/9)OPT(I)+6/9.” In Combinatorics, Algorithms, Probabilistic and Experimental Methodologies. Springer-Verlag, 2007.
- Doley, D., C. Dwork, and L. Stockmeyer. “On the Minimal Synchronism Needed for Distributed Consensus,” Journal of the ACM 34(1) (1987): 77-97. http://bit.ly/2LHRy9K.
- Dünner, C., T. Parnell, K. Atasu, M. Sifalakis, and H. Pozidis. “High-Performance Distributed Machine Learning Using Apache Spark,” December 2016. http://bit.ly/2JoSgH4.
- Koeninger, Cody, Davies Liu, and Tathagata Das. “Improvements to Kafka Integration of Spark Streaming,” Databricks Engineering blog, March 30, 2015. http://bit.ly/2Hn7dat.
- Miller, H., P. Haller, N. Müller, and J. Boullier “Function Passing: A Model for Typed, Distributed Functional Programming,” ACM SIGPLAN Conference on Systems, Programming, Languages and Applications: Software for Humanity, Onward! November 2016: (82-97). http://bit.ly/2EQASaf.
- Lamport, Leslie. “The Part-Time Parliament,” ACM Transactions on Computer Systems 16(2): 133–169. http://bit.ly/2W3zr1R.
- Kestelyn, J. “Exactly-once Spark Streaming from Apache Kafka,” Cloudera Engineering blog, March 16, 2015. http://bit.ly/2EniQfJ.
- Valiant, L.G. “Bulk-synchronous parallel computers,” Communications of the ACM 33:8 (August 1990). http://bit.ly/2IgX3ar.
- Sharp, Alexa Megan. “Incremental algorithms: solving problems in a changing world,” PhD diss., Cornell University, 2007. http://bit.ly/2Ie8MGX.
- Maas, Gérard. “Tuning Spark Streaming for Throughput,” Virdata Engineering blog, December 22, 2014. http://www.virdata.com/tuning-spark/.
- Shapira, Gwen. “Building The Lambda Architecture with Spark Streaming,” Cloudera Engineering blog, August 29, 2014. http://bit.ly/2XoyHBS.
- Venkataraman, S., P. Aurojit, K. Ousterhout, A. Ghodsi, M. J. Franklin, B. Recht, and I. Stoica. “Drizzle: Fast and Adaptable Stream Processing at Scale,” Tech Report, UC Berkeley, 2016. http://bit.ly/2HW08Ot.
- [Zaharia2011] Zaharia, Matei, Mosharaf Chowdhury, et al. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” UCB/EECS-2011-82. http://bit.ly/2IfZE4q.
- Zaharia, Matei, Tathagata Das, et al. “Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing,” UCB/EECS-2012-259. http://bit.ly/2MpuY6c.