项目作者: cfmcgrady

项目描述 :
Kungfu Panda is a library for register python pandas UDFs in Spark SQL.
高级语言: Scala
项目地址: git://github.com/cfmcgrady/kungfu-panda.git
创建时间: 2019-09-02T08:09:39Z
项目社区:https://github.com/cfmcgrady/kungfu-panda

开源协议:

下载


Kungfu Panda

Kungfu Panda is a library for register python pandas UDFs in Spark SQL.

Quick Start

  1. download project.

    1. git clone https://github.com/cfmcgrady/kungfu-panda.git
  2. install python environment by conda.

    1. conda env create -f path/to/conda.yaml -p /tmp/kungfu-panda
  3. train a Kmean classify model with mlflow.

    1. /tmp/kungfu-panda/bin/python path/to/train.py
  4. register model.

    1. val spark = SparkSession
    2. .builder()
    3. .appName("kungfu panda example")
    4. .master("local[4]")
    5. .getOrCreate()
    6. val python = "/tmp/kungfu-panda/bin/python"
    7. val artifactRoot = "."
    8. // find run id with mlflow.
    9. val runid = "9c6c59d0f57f40dfbbded01816896687"
    10. val pythonExec = Option(python)
    11. PandasFunctionManager.registerMLFlowPythonUDF(
    12. spark, "test",
    13. returnType = Option(IntegerType),
    14. artifactRoot = Option(artifactRoot),
    15. runId = runid,
    16. driverPythonExec = pythonExec,
    17. driverPythonVer = None,
    18. pythonExec = pythonExec,
    19. pythonVer = None)
    20. spark.sql(
    21. """
    22. |select test(x, y) from (
    23. |select 1 as x, 1 as y
    24. |)
    25. |""".stripMargin)
    26. .show()

Register Function With Spark SQL

  1. add parser extensions when we create SparkSession

    1. val spark = SparkSession
    2. .builder()
    3. .appName("panda sql example")
    4. .master("local[4]")
    5. .withExtensions(CreateFunctionParser.extBuilder)
    6. .getOrCreate()
  2. register mlflow function.

    1. CREATE FUNCTION `test` AS '${runid}' USING `type` 'mlflow', `returns` 'integer', `artifactRoot` '${artifactRoot}', `pythonExec` '${python}'

visit PandaSqlExample for full example.

Run On Yarn Cluster

// todo