Spark and MLlib evaluation and stress test
In this project, we try to evaluate the performance of Apache Spark & its MLlib
library in function of different parameters: data size, slave nodes, CPU
cores…
Data used in the tests comes from a poker hand data set, and you can
find it here.
Results
this folder contains our graphs and explainations about our testenvironment.sh
a script you will certainly need to adapt to your Sparkinit.sh
An init script to download the data, unzip it and set theperformance.py
The main script to runtestTree.py
The training method of our programtraining.py
called by performance.py
to execute the tests and returnTherefore, to run the project:
./init.sh
$SPARK_HOME/sbin
spark-submit performance.py outputfile NUMBER_OF_PARTITION
note : NUMBER_OF_PARTION define how the RDD will be partitioned inside spark
system. Not nough partition and you will end up with not enough parallelisation
of the work. For our experiments, we define this number equal to the number of
cores available.