项目作者: crackcell

项目描述 :
Feature engineering toolkit for Spark MLlib.
高级语言: Scala
项目地址: git://github.com/crackcell/mlfeature.git
创建时间: 2017-02-15T15:56:29Z
项目社区:https://github.com/crackcell/mlfeature

开源协议:Apache License 2.0

下载


MLfeature

Feature engineering toolkit for Spark MLlib:

  • Data preprocessing:
    • Handle imbalanced dataset: DataBalancer
    • Handle missing values: (Implemented in Spark 2.2, SPARK-13568)
      • Impute continuous missing values with mean: MissingValueMeanImputor
  • Feature selection:
    • VarianceSelector: remove fetures with low variance
    • UnivariateSelector: feature selection with univariate metrics
    • ByModelSelector: feature selection with a model
  • Feature transformers:
    • Enhanced Bucketizer: MyBucketizer (Waiting to be merged, SPARK-19781)
    • Enhanced StringIndexer: MyStringIndexer (Merged with Spark 2.2, SPARK-17233)

Handle imbalcned dataset

  • DataBalancer: Make an balanced dataset with multiple strategies:
    • Re-sampling:
      • over-sampling
      • under-sampling
      • middle-sampling
    • SMOTE: TODO

Example:

  1. val data = Array("a", "a", "b", "c")
  2. val dataFrame = data.toSeq.toDF("feature")
  3. val balancer = new DataBalancer()
  4. .setStrategy("oversampling")
  5. .setInputCol("feature")
  6. val result = balacner.transform(dataFrame)
  7. result.show(100)
  1. val data: Seq[String] = Seq("a", "a", "a", "a", "b", "b","b", "c")
  2. val dataFrame = data.toDF("feature")
  3. val balancer = new DataBalancer()
  4. .setStrategy("undersampling")
  5. .setInputCol("feature")
  6. val result = balancer.transform(dataFrame)
  7. result.show(100)
  1. val data: Seq[String] = Seq("a", "a", "a", "a", "b", "b","b", "c")
  2. val dataFrame = data.toDF("feature")
  3. val balancer = new DataBalancer()
  4. .setStrategy("middlesampling")
  5. .setInputCol("feature")
  6. val result = balancer.transform(dataFrame)
  7. result.show(100)

Handle missing values

  • MissingValueMeanImputer: Impute continuous missing values with mean

Feature Selection

VarianceSelector

VarianceSelector is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

  1. val data = Array(
  2. Vectors.dense(0, 1.0, 0),
  3. Vectors.dense(0, 3.0, 0),
  4. Vectors.dense(0, 4.0, 0),
  5. Vectors.dense(0, 5.0, 0),
  6. Vectors.dense(1, 6.0, 0)
  7. )
  8. val expected = Array(
  9. Vectors.dense(1.0),
  10. Vectors.dense(3.0),
  11. Vectors.dense(4.0),
  12. Vectors.dense(5.0),
  13. Vectors.dense(6.0)
  14. )
  15. val df = data.zip(expected).toSeq.toDF("features", "expected")
  16. val selector = new VarianceSelector()
  17. .setInputCol("features")
  18. .setOutputCol("selected")
  19. .setThreshold(3)
  20. val result = selector.transform(df)
  21. result.select("expected", "selected").collect()
  22. .foreach { case Row(vector1: Vector, vector2: Vector) =>
  23. assert(vector1.equals(vector2), "Transformed vector is different with expected.")
  24. }

UnivariateSelector

TODO

ByModelSelector

TODO

Feature transform

MyBucketizer: Enhanced Bucketizer

Put NULLs and values out of bounds into a special bucket as well as NaN.

Example:

  1. val splits = Array(-0.5, 0.0, 0.5)
  2. val validData = Array(-0.5, -0.3, 0.0, 0.2)
  3. val expectedBuckets = Array(0.0, 0.0, 1.0, 1.0)
  4. val dataFrame: DataFrame = validData.zip(expectedBuckets).toSeq.toDF("feature", "expected")
  5. val bucketizer: MyBucketizer = new MyBucketizer()
  6. .setInputCol("feature")
  7. .setOutputCol("result")
  8. .setSplits(splits)
  9. val transformed = bucketizer.transform(dataFrame)

MyStringIndxer: Enhanced StringIndexer

Give NULLs and unseen lables a special index.

Example:

  1. val data = Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
  2. val df = data.toDF("id", "label")
  3. val indexer = new MyStringIndexer()
  4. .setInputCol("label")
  5. .setOutputCol("labelIndex")
  6. .fit(df)
  7. val transformed = indexer.transform(df)