项目作者: kekepins

项目描述 :
Spark Percentile user defined aggregation in java
高级语言: Java
项目地址: git://github.com/kekepins/spark-percentile.git
创建时间: 2018-11-30T08:50:53Z
项目社区:https://github.com/kekepins/spark-percentile

开源协议:

下载


spark-percentile

From the wikipedia page about percentile:
https://en.wikipedia.org/wiki/Percentile

Let’s do some experiments with spark.

Spark has two built-in percentile implementations:

  • percentile
  • percentile_approx

Let’s implement 3 new one described in wikipedia page:

  • nearest rank
  • interpolation C1
  • interpolation C0

In spark this can be done with UserDefinedAggregateFunction, java code can be see here

Now test this on various examples from wikipedia page code here

With code:

  1. // Register udf
  2. sparkSession.udf().register("percentileC1", new MyPercentile(percentiles, PercentileMode.INTERPOLATION_C1));
  3. sparkSession.udf().register("percentileC0", new MyPercentile(percentiles, PercentileMode.INTERPOLATION_C0));
  4. sparkSession.udf().register("percentileNearestRank", new MyPercentile(percentiles, PercentileMode.NEAREST_RANK));
  5. // Get a dataset
  6. Dataset<Row> ds = fromArray(sparkSession, values);
  7. ds.show(false);
  8. // Compute percentiles
  9. ds = ds.select(
  10. callUDF("percentileC1", col("data")).as("Percentile C1"),
  11. callUDF("percentileC0", col("data")).as("Percentile C0"),
  12. callUDF("percentileNearestRank", col("data")).as("Percentile Nearest Rank"),
  13. callUDF("percentile_approx", col("data"), lit( percentiles) ).as("percentile_approx (spark builtin)"),
  14. callUDF("percentile", col("data"), lit( percentiles) ).as("percentile (spark builtin)")
  15. );
  16. ds.show(false);

Test 1, Nearest rank example 1

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
16.0, 23.0, 29.0, 35.0, 50.0 15.0, 19.0, 26.0, 35.0, 50.0 15.0, 20.0, 20.0, 35.0, 50.0 20.0, 20.0, 20.0, 35.0, 50.0 16.0, 23.0, 29.0, 35.0, 50.0

Test 2, Nearest rank example 2

data
3.0
6.0
7.0
8.0
8.0
10.0
13.0
15.0
16.0
20.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
7.25, 9.0, 14.5, 20.0 6.75, 9.0, 15.25, 20.0 7.0, 8.0, 15.0, 20.0 7.0, 8.0, 15.0, 20.0 7.25, 9.0, 14.5, 20.0

Test 3, Nearest rank example 3

data
3.0
6.0
7.0
8.0
8.0
9.0
10.0
13.0
15.0
16.0
20.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
7.5, 9.0, 14.0, 20.0 7.0, 9.0, 15.0, 20.0 7.0, 9.0, 15.0, 20.0 7.0, 9.0, 15.0, 20.0 7.5, 9.0, 14.0, 20.0

Test 4, Interpolation between closest rank (C=1) example 1 (second variant)

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
29.0 26.0 20.0 20.0 29.0

Test 5, Interpolation between closest rank (C=1) example 2

data
1.0
2.0
3.0
4.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
3.25 3.75 3.0 3.0 3.25

Test 6, Interpolation between closest rank (C=0) example 1 (third variant)

data
15.0
20.0
35.0
40.0
50.0
Percentile C1 Percentile C0 Percentile Nearest Rank percentile_approx (spark builtin) percentile (spark builtin)
16.0, 23.0, 29.0, 48.0 15.0, 19.0, 26.0, 50.0 15.0, 20.0, 20.0, 50.0 20.0, 20.0, 20.0, 50.0 16.0, 23.0, 29.0, 48.0