项目作者: mskimm

项目描述 :
Building Annoy Index on Apache Spark
高级语言: Scala
项目地址: git://github.com/mskimm/spark-annoy.git
创建时间: 2016-09-07T23:33:36Z
项目社区:https://github.com/mskimm/spark-annoy

开源协议:Apache License 2.0

下载


Build Status
Maven metadata URI

spark-annoy (WIP)

Building Annoy Index on Apache Spark. Then query neighbors using Annoy.

Note

I had built an index of 117M 64-dimensional vectors using 100 nodes in 5 minutes. The settings was;

  1. // version: 0.1.4
  2. // spark.executor.instances = 100
  3. // spark.executor.memory = 8g
  4. // spark.driver.memory = 8g
  5. val fraction = 0.00086 // for about 100k samples
  6. val numTrees = 2
  7. val numPartitions = 100
  8. val annoyModel = new Annoy().setFraction(fraction).setNumTrees(numTrees).fit(dataset)
  9. annoyModel.saveAsAnnoyBinary("/hdfs/path/to/index", numPartitions)

The size of the index is about 33G.

Distributed Builds

  1. import spark.implicits._
  2. val data = spark.read.textFile("data/annoy/sample-glove-25-angular.txt")
  3. .map { str =>
  4. val Array(id, features) = str.split("\t")
  5. (id.toInt, features.split(",").map(_.toFloat))
  6. }
  7. .toDF("id", "features")
  8. val ann = new Annoy()
  9. .setNumTrees(2)
  10. val annModel = ann.fit(data)
  11. annModel.saveAsAnnoyBinary("/path/to/dump/annoy-binary")

Dependency

From the version 0.1.2, it is released to Maven.

  1. libraryDependencies += "com.github.mskimm" %% "ann4s" % "0.1.5"
  • 0.1.5 is built with Apache Spark 2.3.0

How does it work?

  1. builds a parent tree using sampled data on Spark Master
  2. all data are grouped by the leaf node of parent tree on Spark Nodes
  3. builds subtree using the grouped data on each Spark Nodes
  4. aggregate all nodes of subtree to parent tree on Spark Master

Use Case

Index ALS User/Item Factors

  • src/test/scala/ann4s/spark/example/ALSBasedUserItemIndexing.scala
  1. ...
  2. val training: DataFrame = _
  3. val als = new ALS()
  4. .setMaxIter(5)
  5. .setRegParam(0.01)
  6. .setUserCol("userId")
  7. .setItemCol("movieId")
  8. .setRatingCol("rating")
  9. val model = als.fit(training)
  10. val ann = new Annoy()
  11. .setNumTrees(2)
  12. .setFraction(0.1)
  13. .setIdCol("id")
  14. .setFeaturesCol("features")
  15. val userAnnModel= ann.fit(model.userFactors)
  16. userAnnModel.writeAnnoyBinary("exp/als/user_factors.ann")
  17. val itemAnnModel = ann.fit(model.itemFactors)
  18. itemAnnModel.writeAnnoyBinary("exp/als/item_factors.ann")
  19. ...

Comment

I personally started this project to study Scala. I found out that Annoy
is a fairly good library for nearest neighbors search and can be implemented
distributed version using Apache Spark. Recently, various bindings and
implementations have been actively developed. In particular, the purpose
and usability of this project overlap with some projects like
annoy4s and
annoy-java in terms of running on JVM.

To continue contribution, from now on this project focuses on building Index
on Apache Spark for distributed builds. This will support building using
1 billion or more items and writing Annoy compatible binary.

References