项目作者: kernelmethod

项目描述 :
Locality-sensitive hashing (LSH) in Julia.
高级语言: Julia
项目地址: git://github.com/kernelmethod/LSHFunctions.jl.git
创建时间: 2019-07-19T04:19:07Z
项目社区:https://github.com/kernelmethod/LSHFunctions.jl

开源协议:BSD 3-Clause "New" or "Revised" License

下载


LSHFunctions.jl

Stable docs Dev docs
Build Status
Codecov
DOI

A Julia package for locality-sensitive
hashing
to accelerate
similarity search.

What’s LSH?

Traditionally, if you have a data point x, and want to find the most similar
point(s) to x in your database, you would compute the similarity between x
and all of the points in your database, and keep whichever points were the most
similar. For instance, this type of approach is used by the classic k-nearest
neighbors algorithm
.
However, it has two major problems:

  • The time to find the most similar point(s) to x is linear in the number of
    points in your database. This can make similarity search prohibitively
    expensive for even moderately large datasets.
  • In addition, the time complexity to compute the similarity between two
    datapoints is typically linear in the number of dimensions of those
    datapoints. If your data are high-dimensional (i.e. in the thousands to
    millions of dimensions), every similarity computation you perform can be
    fairly costly.

Locality-sensitive hashing (LSH) is a technique for accelerating these kinds
of similarity searches. Instead of measuring how similar your query point is to
every point in your database, you calculate a few hashes of the query point and
only compare it against those points with which it experiences a hash collision.
Locality-sensitive hash functions are randomly generated, with the fundamental
property that as the similarity between x and y increases, the probability
of a hash collision between x and y also increases.

Installation

You can install LSHFunctions.jl from the Julia REPL with

  1. pkg> add LSHFunctions

Supported similarity functions

So far, there are hash functions for the similarity functions:

  • Cosine similarity (SimHash)
  • Jaccard similarity (MinHash)
  • L1 (Manhattan / “taxicab”) distance: L1Hash
  • L2 (Euclidean) distance: L2Hash
  • Inner product
    • SignALSH (recommended)
    • MIPSHash
  • Function-space hashes (supports L1, L2, and cosine similarity)
    • MonteCarloHash
    • ChebHash

This package still needs a lot of work, including improvement to the
documentation and API.

Examples

The easiest way to start constructing new hash functions is by calling
LSHFunction with the following syntax:

  1. hashfn = LSHFunction(similarity function,
  2. number of hash functions to generate;
  3. [LSH family-specific keyword arguments])

For example, the following snippet generates 10 locality-sensitive hash
functions (bundled together into a single SimHash ) for cosine similarity:

  1. julia> using LSHFunctions;
  2. julia> hashfn = LSHFunction(cossim, 10);
  3. julia> n_hashes(hashfn)
  4. 10
  5. julia> similarity(hashfn)
  6. cossim

You can hash inputs by calling hashfn():

  1. julia> x = randn(128);
  2. julia> x_hashes = hashfn(x);

For more details, check out the LSHFunctions.jl
documentation
.