Document Classification with Apache Spark.pdf


立即下载 NetworkAttachedStorage
2024-04-11
  2015 MapR Technologies etc. data Spark Apache ing Classification
1.2 MB

®© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Document Classification with Apache Spark
Joseph Blue, @joebluems
August 19, 2015
®© 2015 MapR Technologies 2
Background survey
q  Data Science
q  Supervised vs. Unsupervised Scenarios
q  Classification Algorithms: Naïve Bayes, Linear, Decision Trees, etc.
q Model metrics: KS, AuROC, etc.
q  Boosting, Stacking, Bagging, etc.
q  TF-IDF Feature Extraction
q  Apache Spark: RDD, DAG, Scala shell, MLlib
q  Applying Machine Learning to Business Problems
®© 2015 MapR Technologies 3
Big Data Solution Workflow
EXTRACT OUTLIER REMOVAL
FEATURE
CREATION MODELING
Import data
into Hadoop
and transform
into format
appropriate
for solution
ENSEMBLE
Extract from
the raw data
inputs that the
ML algorithms
will use for
pattern
detection
Identify and
remove / adjust
records that
negatively affect
the ability to
achieve good
performance
Application of the
app


 /2015/MapR/Technologies/etc./data/Spark/Apache/ing/Classification/  /2015/MapR/Technologies/etc./data/Spark/Apache/ing/Classification/
-1 条回复
登录 后才能参与评论
-->