项目作者: KristofferC

项目描述 :
High performance nearest neighbor data structures and algorithms for Julia.
高级语言: Julia
项目地址: git://github.com/KristofferC/NearestNeighbors.jl.git
创建时间: 2015-10-31T23:16:37Z
项目社区:https://github.com/KristofferC/NearestNeighbors.jl

开源协议:Other

下载


NearestNeighbors.jl

Build Status
codecov.io
DOI

NearestNeighbors.jl is a Julia package for performing nearest neighbor searches.

Creating a Tree

There are currently three types of trees available:

  • KDTree: Recursively splits points into groups using hyper-planes.
  • BallTree: Recursively splits points into groups bounded by hyper-spheres.
  • BruteTree: Not actually a tree. It linearly searches all points in a brute force manner.

These trees can be created using the following syntax:

  1. KDTree(data, metric; leafsize, reorder)
  2. BallTree(data, metric; leafsize, reorder)
  3. BruteTree(data, metric; leafsize, reorder) # leafsize and reorder are unused for BruteTree
  • data: The points to build the tree from, either as
    • A matrix of size nd × np where nd is the dimensionality and np is the number of points, or
    • A vector of vectors with fixed dimensionality nd, i.e., data should be a Vector{V} where V is a subtype of AbstractVector with defined length(V). For example a Vector{V} where V = SVector{3, Float64} is ok because length(V) = 3 is defined.
  • metric: The Metric (from Distances.jl) to use, defaults to Euclidean. KDTree works with axis-aligned metrics: Euclidean, Chebyshev, Minkowski, and Cityblock while for BallTree and BruteTree other pre-defined Metrics can be used as well as custom metrics (that are subtypes of Metric).
  • leafsize: Determines the number of points (default 10) at which to stop splitting the tree. There is a trade-off between tree traversal and evaluating the metric for an increasing number of points.
  • reorder: If true (default), during tree construction this rearranges points to improve cache locality during querying. This will create a copy of the original data.

All trees in NearestNeighbors.jl are static, meaning points cannot be added or removed after creation.
Note that this package is not suitable for very high dimensional points due to high compilation time and inefficient queries on the trees.

Example of creating trees:

  1. using NearestNeighbors
  2. data = rand(3, 10^4)
  3. # Create trees
  4. kdtree = KDTree(data; leafsize = 25)
  5. balltree = BallTree(data, Minkowski(3.5); reorder = false)
  6. brutetree = BruteTree(data)

k-Nearest Neighbor (kNN) Searches

A kNN search finds the k nearest neighbors to a given point or points. This is done with the methods:

  1. knn(tree, point[s], k, skip = always_false) -> idxs, dists
  2. knn!(idxs, dists, tree, point, k, skip = always_false)
  • tree: The tree instance.
  • points: A vector or matrix of points to find the k nearest neighbors for. A vector of numbers represents a single point; a matrix means the k nearest neighbors for each point (column) will be computed. points can also be a vector of vectors.
  • skip (optional): A predicate to skip certain points, e.g., points already visited.

For the single closest neighbor, you can use nn:

  1. nn(tree, points, skip = always_false) -> idxs, dists

Examples:

  1. using NearestNeighbors
  2. data = rand(3, 10^4)
  3. k = 3
  4. point = rand(3)
  5. kdtree = KDTree(data)
  6. idxs, dists = knn(kdtree, point, k, true)
  7. idxs
  8. # 3-element Array{Int64,1}:
  9. # 4683
  10. # 6119
  11. # 3278
  12. dists
  13. # 3-element Array{Float64,1}:
  14. # 0.039032201026256215
  15. # 0.04134193711411951
  16. # 0.042974090446474184
  17. # Multiple points
  18. points = rand(3, 4)
  19. idxs, dists = knn(kdtree, points, k, true)
  20. idxs
  21. # 4-element Array{Array{Int64,1},1}:
  22. # [3330, 4072, 2696]
  23. # [1825, 7799, 8358]
  24. # [3497, 2169, 3737]
  25. # [1845, 9796, 2908]
  26. # dists
  27. # 4-element Array{Array{Float64,1},1}:
  28. # [0.0298932, 0.0327349, 0.0365979]
  29. # [0.0348751, 0.0498355, 0.0506802]
  30. # [0.0318547, 0.037291, 0.0421208]
  31. # [0.03321, 0.0360935, 0.0411951]
  32. # Static vectors
  33. using StaticArrays
  34. v = @SVector[0.5, 0.3, 0.2];
  35. idxs, dists = knn(kdtree, v, k, true)
  36. idxs
  37. # 3-element Array{Int64,1}:
  38. # 842
  39. # 3075
  40. # 3046
  41. dists
  42. # 3-element Array{Float64,1}:
  43. # 0.04178677766255837
  44. # 0.04556078331418939
  45. # 0.049967238112417205
  46. # Preallocating input results
  47. idxs, dists = zeros(Int32, k), zeros(Float32, k)
  48. knn!(idxs, dists, kdtree, v, k)

Range Searches

A range search finds all neighbors within the range r of given point(s). This is done with the methods:

  1. inrange(tree, points, r) -> idxs
  2. inrange!(idxs, tree, point, r)

Distances are not returned.

Example:

  1. using NearestNeighbors
  2. data = rand(3, 10^4)
  3. r = 0.05
  4. point = rand(3)
  5. balltree = BallTree(data)
  6. idxs = inrange(balltree, point, r)
  7. # 4-element Array{Int64,1}:
  8. # 317
  9. # 983
  10. # 4577
  11. # 8675
  12. # Updates `idxs`
  13. idxs = Int32[]
  14. inrange!(idxs, balltree, point, r)
  15. # counts points without allocating index arrays
  16. neighborscount = inrangecount(balltree, point, r)

Using On-Disk Data Sets

By default, trees store a copy of the data provided during construction. For data sets larger than available memory, DataFreeTree can be used to strip a tree of its data field and re-link it later.

Example with a large on-disk data set:

  1. using Mmap
  2. ndim = 2
  3. ndata = 10_000_000_000
  4. data = Mmap.mmap(datafilename, Matrix{Float32}, (ndim, ndata))
  5. data[:] = rand(Float32, ndim, ndata) # create example data
  6. dftree = DataFreeTree(KDTree, data)

dftree stores the indexing data structures. To perform look-ups, re-link the tree to the data:

  1. tree = injectdata(dftree, data) # yields a KDTree
  2. knn(tree, data[:,1], 3) # perform operations as usual