项目作者: matsui528

项目描述 :
Fast and memory-efficient ANN with a subset-search functionality
高级语言: Python
项目地址: git://github.com/matsui528/rii.git
创建时间: 2018-07-24T09:12:54Z
项目社区:https://github.com/matsui528/rii

开源协议:MIT License

下载


Build
Documentation Status
PyPI version
Downloads

Reconfigurable Inverted Index (Rii): IVFPQ-based fast and memory efficient approximate nearest neighbor search method
with a subset-search functionality.

Reference:

Summary of features

The search can be operated for a subset of a database. Rii remains fast even after many new items are added.
  • Fast and memory efficient ANN. Rii enables you to run billion-scale search in less than 10 ms.
  • You can run the search over a subset of the whole database
  • Rii Remains fast even after many vectors are newly added (i.e., the data structure can be reconfigured)

Installing

You can install the package via pip. This library works with Python 3.6+ on linux/mac/wsl/Windows10

  1. pip install rii

For windows (maintained by @ashleyabraham)

### Installing in Windows 10 via pip install
Requires MS Visual Studio Build tools C++ 14.0 or 14.1 toolset or above to compile and install via pip install

### Pre-compiled binary for Windows 10
Pre-compiled binaries doesn’t require MS Visual Studio Build tools

#Python 3.8 pip install https://github.com/ashleyabraham/rii/releases/download/v0.2.8/rii-0.2.8-cp38-cp38-win_amd64.whl
#Python 3.7 pip install https://github.com/ashleyabraham/rii/releases/download/v0.2.8/rii-0.2.8-cp37-cp37m-win_amd64.whl

#### OpenMP
OpenMP requires libomp140_x86_64.dll to compile in windows, which is part of MS Visual Studio Build tools and it is not redistributable.

In order to use OpenMP 3.0 /openmp:llvm flag is used which causes warnings of multiple libs loading, use at your descretion when used with other parallel processing library loadings. To supress use

SET KMP_DUPLICATE_LIB_OK=TRUE

#### SIMD
The /arch:AVX2 flag is used in MSVC to set appropriate SIMD preprocessors and compiler intrinsics

Documentation

Usage

Basic ANN

  1. import rii
  2. import nanopq
  3. import numpy as np
  4. N, Nt, D = 10000, 1000, 128
  5. X = np.random.random((N, D)).astype(np.float32) # 10,000 128-dim vectors to be searched
  6. Xt = np.random.random((Nt, D)).astype(np.float32) # 1,000 128-dim vectors for training
  7. q = np.random.random((D,)).astype(np.float32) # a 128-dim vector
  8. # Prepare a PQ/OPQ codec with M=32 sub spaces
  9. codec = nanopq.PQ(M=32).fit(vecs=Xt) # Trained using Xt
  10. # Instantiate a Rii class with the codec
  11. e = rii.Rii(fine_quantizer=codec)
  12. # Add vectors
  13. e.add_configure(vecs=X)
  14. # Search
  15. ids, dists = e.query(q=q, topk=3)
  16. print(ids, dists) # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]

Note that you can construct a PQ codec and instantiate the Rii class at the same time if you want.

  1. e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt))
  2. e.add_configure(vecs=X)

Furthermore, you can even write them in one line by chaining a function.

  1. e = rii.Rii(fine_quantizer=nanopq.PQ(M=32).fit(vecs=Xt)).add_configure(vecs=X)
  1. # The search can be conducted over a subset of the database
  2. target_ids = np.array([85, 132, 236, 551, 694, 728, 992, 1234]) # Specified by IDs
  3. # For windows, you must specify dtype=np.int64 as follows.
  4. # target_ids = np.array([85, 132, 236, 551, 694, 728, 992, 1234], dtype=np.int64)
  5. ids, dists = e.query(q=q, topk=3, target_ids=target_ids)
  6. print(ids, dists) # e.g., [728 85 132] [14.80522156 15.92787838 16.28690338]

Data addition and reconfiguration

  1. # Add new vectors
  2. X2 = np.random.random((1000, D)).astype(np.float32)
  3. e.add(vecs=X2) # Now N is 11000
  4. e.query(q=q) # Ok. (0.12 msec / query)
  5. # However, if you add quite a lot of vectors, the search might become slower
  6. # because the data structure has been optimized for the initial item size (N=10000)
  7. X3 = np.random.random((1000000, D)).astype(np.float32)
  8. e.add(vecs=X3) # A lot. Now N is 1011000
  9. e.query(q=q) # Slower (0.96 msec/query)
  10. # In such case, run the reconfigure function. That updates the data structure
  11. e.reconfigure()
  12. e.query(q=q) # Ok. (0.21 msec / query)

I/O by pickling

  1. import pickle
  2. with open('rii.pkl', 'wb') as f:
  3. pickle.dump(e, f)
  4. with open('rii.pkl', 'rb') as f:
  5. e_dumped = pickle.load(f) # e_dumped is identical to e

Util functions

  1. # Print the current parameters
  2. e.print_params()
  3. # Delete all PQ-codes and posting lists. fine_quantizer is kept.
  4. e.clear()
  5. # You can switch the verbose flag
  6. e.verbose = False
  7. # You can merge two Rii instances if they have the same fine_quantizer
  8. e1 = rii.Rii(fine_quantizer=codec)
  9. e2 = rii.Rii(fine_quantizer=codec)
  10. e1.add_configure(vecs=X1)
  11. e2.add_configure(vecs=X2)
  12. e1.merge(e2) # Now e1 contains both X1 and X2

Examples

Author

Credits