我正在尝试使用Spark和Mahout在Scala上构建一个基本的推荐器。我使用跟随mahout repo来编译mahout与scala 2.11和spark 2.1.2 mahout_fork
要执行我的代码,我使用spark-submit …
那个回购只是Mahout的一个子集。它被记录在一个名为的PredictionIO推荐模板中使用 通用推荐人 。也就是说,要在模板外部使用它,您还必须设置序列化程序以识别内部Mahout数据结构。这可能是上面关于序列化的问题。我们通过使用以下键/值对设置spark配置在UR中执行此操作(以JSON显示):
"spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", "spark.kryo.referenceTracking": "false", "spark.kryoserializer.buffer": "300m",
尝试将这些传递给spark-submit或将它们放入带有驱动程序代码的上下文中。
You have added a build problem above, please try to stick to one problem per SO question. I would suggest that you use the binaries we host on GitHub with SBT via something like: val mahoutVersion = "0.13.0" val sparkVersion = "2.1.1" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.1.1" % "provided", "org.apache.spark" %% "spark-mllib" % "2.1.1" % "provided", "org.xerial.snappy" % "snappy-java" % "1.1.1.7", // Mahout's Spark libs. They're custom compiled for Scala 2.11 "org.apache.mahout" %% "mahout-math-scala" % mahoutVersion, "org.apache.mahout" %% "mahout-spark" % mahoutVersion exclude("org.apache.spark", "spark-core_2.11"), "org.apache.mahout" % "mahout-math" % mahoutVersion, "org.apache.mahout" % "mahout-hdfs" % mahoutVersion exclude("com.thoughtworks.xstream", "xstream") exclude("org.apache.hadoop", "hadoop-client") // other external libs ) resolvers += "Temp Scala 2.11 build of Mahout" at "https://github.com/actionml/mahout_2.11/raw/mvn-repo/"
您不希望构建该分支,因为它只需要将Mahout的Scala / Samsara部分所需的模块发布到与SBT兼容的特殊格式的repo中。
Mahout人(包括我)正在努力发布支持SBT,Scala 2.11和2.12以及更新版Spark的版本。它运行在Apache的主分支中,很快就会发布。现在,上面可能会让你前进。