From fa974f8054e859155f21d8982fbe3d81ef64afd1 Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Mon, 17 Aug 2020 17:50:28 -0700 Subject: [PATCH 01/29] Update the documentation of HNSWLib --- README.md | 318 +++++++++++++++++++++++------------------------------- TESTS.md | 44 ++++++++ 2 files changed, 176 insertions(+), 186 deletions(-) create mode 100644 TESTS.md diff --git a/README.md b/README.md index 559c5dfd..aec7e7b1 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,100 @@ -# Hnswlib - fast approximate nearest neighbor search -Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment -**NEWS:** -* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the perfromance/memory should not degrade as you update the element embeddinds).** +# HNSWLIB - Fast Approximate Nearest Neighbor Search -* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not mutiple of 4** +Hnswlib is a C++ library with Python bindings for highly performant implementation of [HNSW](https://arxiv.org/abs/1603.09320) *(Hierarchical Navigable Small World Graphs)* algorithm to perform fast and efficient vector similarity search in high dimensional spaces . It achieves state-of-the-art performance on diverse datasets and one of the top-most leaders in ANN performance benchmarks as show in *[ann-benchmarks.com](http://ann-benchmarks.com)*. +HNSW algorithm is being leveraged globally for performing fast and efficient similarity search. Some public examples for the usage are ***Facebook*** ([Faiss](https://github.com/facebookresearch/faiss)), ***Twitter*** ([Paper](KDD paper link)), ***Pinterest*** ([Paper]([https://arxiv.org/pdf/2007.03634.pdf](https://arxiv.org/pdf/2007.03634.pdf))), ***Amazon*** ([Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)), ***Microsoft*** ([HNSW .NET]([https://github.com/microsoft/HNSW.Net](https://github.com/microsoft/HNSW.Net))), ***Open Distro*** ([Blog](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)) etc. -* **Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib can now be installed via pip!** -Highlights: -1) Lightweight, header-only, no dependencies other than C++ 11. -2) Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw). -3) Has full support for incremental index construction. Has support for element deletions -(currently, without actual freeing of the memory). -4) Can work with custom user defined distances (C++). -5) Significantly less memory footprint and faster build time compared to current nmslib's implementation. +### News -Description of the algorithm parameters can be found in [ALGO_PARAMS.md](ALGO_PARAMS.md). +* *Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates i.e feature vectors can be updated incrementally for elements without rebuilding the index from scratch (Interface remains the same as element insertion)* +* *Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not multiple of 4.* -### Python bindings +* *Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib can now be installed via pip!* + +### Highlights + +1) Very Lightweight, header-only and no dependencies other than C++ 11. +2) Works well for both low and high dimensional datasets. +3) It belongs to unrestricted memory ANN that allow the vectors to be stored in memory. No bound on RAM allows the best performance in terms of speed and resulting accuracy. +4) Interfaces and bindings for C++, Python. External bindings for [R](https://github.com/jlmelville/rcpphnsw) and [Java](https://github.com/stepstone-tech/hnswlib-jna) contributed by community. +5) Other external implementation of the algorithm available in diverse languages like .Net, Go, Java, Python etc. Refer [other implementations](#other-implementations) +6) Significantly less memory footprint and faster build time compared to current NMSLIB's implementation. + + +### Supported Operations + +1) Supports batch(offline) and realtime(online) index. +2) Supports incremental query, insert, update and deletion of vectors. +3) Highly performant and efficient support for multi-threaded reads and writes in parallel (query/insert/update/delete). Multi-thread Performance scales with number of cpu cores in machine. +4) Supports efficient serialization and deserialization of index to/from disk. +5) Can support user defined arbitrary and exotic similarity metrics like Hyperbolic distances (Poincare/Lorentzian), Jaccard distance, Manhattan distance etc. (In C++ version) + +*Note: Currently deletions of elements from the index does not free the associated memory of the vectors to be deleted.* + +### Installation +It can be installed from sources: +```bash +apt-get install -y python-setuptools python-pip +pip3 install pybind11 numpy setuptools +cd python_bindings +python3 setup.py install +``` + +or it can be installed via pip: +`pip install hnswlib` + +### Python code example +***Example*** : Perform in-memory query, inserts, updates, deletes and serialization/deserialization of the index. For algorithm construction and runtime parameter details please refer [Params](ALGO_PARAMS.md). + +```python +import hnswlib +import numpy as np + +################## Declaring and Initializing index ################## +dim = 128 +num_elements = 10000 +index = hnswlib.Index(space = 'l2', dim = dim) # possible space options: [l2/cosine/ip] +# Set number of threads used during batch search/construction +# By default using all available cores +index.set_num_threads(4) +# Initing index - the maximum number of elements should be known beforehand +index.init_index(max_elements = num_elements, ef_construction = 200, M = 16) + +################## Perform elements insertion ################## +data = np.float32(np.random.random((num_elements, dim))) # Generating sample data +data_labels = np.arange(num_elements) +index.add_items(data, data_labels) # Element insertion (can be called several times) + +################## Perform element feature vector updates ################## +element_labels_to_update = data_labels[0] # Update feature vector for first element +element_updated_vectors = np.float32(np.random.random((1, dim))) # Corresponding updated vector +index.add_items(element_labels_to_update, element_updated_vectors) # Perform update + +################## Perform element deletion ################## +element_labels_to_delete = data_label[0:2] # Delete first two elements +index.mark_deleted(element_labels_to_delete) + +################## Perform nearest neighbor querying ################## +index.set_ef(50) # Controlling the recall by setting ef, it should always be > k +# Query dataset, k - number of closest elements (returns 2 numpy arrays) +labels, distances = index.knn_query(data, k = 1) + +################## Serializing the index to disk ################## +index_path='first_half.bin' +index.save_index("first_half.bin") +del index + +################## Loading the index from disk and increase the capacity of the index ################## +index = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. +# If required, total capacity of the index can be increased while loading the index using `max_elements`, so that it will be able to handle insertion of new data +index.load_index(index_path, max_elements = 2 * num_elements) # Increase capacity of the index by 2x +new_data = np.float32(np.random.random((num_elements, dim))) # Generate new sample data +index.add_items(new_data) # Add new data to index + +``` #### Supported distances: @@ -30,33 +104,31 @@ Description of the algorithm parameters can be found in [ALGO_PARAMS.md](ALGO_PA |Inner product |'ip' | d = 1.0 - sum(Ai\*Bi) | |Cosine similarity |'cosine' | d = 1.0 - sum(Ai\*Bi) / sqrt(sum(Ai\*Ai) * sum(Bi\*Bi))| -Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index. +*Note: Inner product is not an actual distance metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.* -For other spaces use the nmslib library https://github.com/nmslib/nmslib. +For other spaces [nmslib](https://github.com/nmslib/nmslib) library can be used or user can define their own distance/similarity metric as part of C++ version of Hnswlib library. -#### Short API description -* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`. - -Index methods: +#### Short Python API description +* `hnswlib.Index(space, dim)` creates a non-initialized HNSW index in space `space` with integer dimension `dim`. * `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements. - * `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk). - * `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)). - * `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)). + * `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk). Will throw an exception if exceeded during insertion of an element. The capacity can be increased by saving/loading the index as specified in the above python example. + * `ef_construction` defines a construction time/accuracy trade-off (See [Params](ALGO_PARAMS.md)). + * `M` defines the maximum number of outgoing connections in the graph (See [Params](ALGO_PARAMS.md)). -* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. +* `add_items(data, data_labels, num_threads = -1)` - **inserts/updates** the `data`(numpy array of vectors, shape:`N*dim`) into the structure. * `labels` is an optional N-size numpy array of integer labels for all elements in `data`. * `num_threads` sets the number of cpu threads to use (-1 means use default). - * `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient. + * `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their feature vectors will be updated. Note that update procedure is slower than insertion of a new element, but more memory and query-efficient. * Thread-safe with other `add_items` calls, but not with `knn_query`. + * ####### May be expose multi-threaded query method and add documentation #### -* `mark_deleted(data_label)` - marks the element as deleted, so it will be ommited from search results. +* `mark_deleted(data_label)` - marks the element as **deleted**, so it will be omitted from search results. * `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`. -* `set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter ( -[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading. +* `set_ef(ef)` - sets the runtime query accuracy/speed trade-off, defined by the `ef` parameter (See [Params](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading (The value does not need to be the same. User can control it according to their performance needs). -* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closests elements for each element of the +* `knn_query(data, k = 1, num_threads = -1)` a batch query for `k` nearest neighbor for each element of the * `data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`). * `num_threads` sets the number of cpu threads to use (-1 means use default). * Thread-safe with other `knn_query` calls, but not with `add_items`. @@ -76,122 +148,25 @@ Index methods: * `get_current_count()` - returns the current number of element stored in the index - - - - -#### Python bindings examples -```python -import hnswlib -import numpy as np - -dim = 128 -num_elements = 10000 - -# Generating sample data -data = np.float32(np.random.random((num_elements, dim))) -data_labels = np.arange(num_elements) - -# Declaring index -p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip - -# Initing index - the maximum number of elements should be known beforehand -p.init_index(max_elements = num_elements, ef_construction = 200, M = 16) - -# Element insertion (can be called several times): -p.add_items(data, data_labels) - -# Controlling the recall by setting ef: -p.set_ef(50) # ef should always be > k - -# Query dataset, k - number of closest elements (returns 2 numpy arrays) -labels, distances = p.knn_query(data, k = 1) -``` - -An example with updates after serialization/deserialization: -```python -import hnswlib -import numpy as np - -dim = 16 -num_elements = 10000 - -# Generating sample data -data = np.float32(np.random.random((num_elements, dim))) - -# We split the data in two batches: -data1 = data[:num_elements // 2] -data2 = data[num_elements // 2:] - -# Declaring index -p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - -# Initing index -# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded -# during insertion of an element. -# The capacity can be increased by saving/loading the index, see below. -# -# ef_construction - controls index search speed/build speed tradeoff -# -# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) -# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction - -p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) - -# Controlling the recall by setting ef: -# higher ef leads to better accuracy, but slower search -p.set_ef(10) - -# Set number of threads used during batch search/construction -# By default using all available cores -p.set_num_threads(4) - - -print("Adding first batch of %d elements" % (len(data1))) -p.add_items(data1) - -# Query the elements for themselves and measure recall: -labels, distances = p.knn_query(data1, k=1) -print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n") - -# Serializing and deleting the index: -index_path='first_half.bin' -print("Saving index to '%s'" % index_path) -p.save_index("first_half.bin") -del p - -# Reiniting, loading the index -p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. +### Tests +To reproduce performance benchmark results on 200M SIFT dataset as described in HNSW paper or to run tests for feature vector updates please refer [Tests](TESTS.md). -print("\nLoading index from 'first_half.bin'\n") +### Authors -# Increase the total capacity (max_elements), so that it will handle the new data -p.load_index("first_half.bin", max_elements = num_elements) +- [Yury Malkov](https://github.com/yurymalkov) is the lead author and developer of the HNSW algorithm and Hnswlib library. +- [Apoorv Sharma](https://github.com/apoorv-sharma) co-authored an algorithm for performing dynamic updates of feature vectors in HNSW with [Yury Malkov](https://github.com/yurymalkov) and implemented it in HnswLib. +- [User2](https://github.com/user) User2 implemented delete... +- [User3](https://github.com/user) User3.... -print("Adding the second batch of %d elements" % (len(data2))) -p.add_items(data2) -# Query the elements for themselves and measure recall: -labels, distances = p.knn_query(data, k=1) -print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n") -``` - -### Bindings installation - -You can install from sources: -```bash -apt-get install -y python-setuptools python-pip -pip3 install pybind11 numpy setuptools -cd python_bindings -python3 setup.py install -``` - -or you can install via pip: -`pip install hnswlib` +### Contributing to the repository and HNSW Community +Contributions are highly welcome! +Please make pull requests against the `develop` branch. +Please feel free to ask questions, report bugs and raise new feature requests at [issues page](https://github.com/nmslib/hnswlib/issues) of the repository. ### Other implementations * Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib -* Faiss libary by facebook, uses own HNSW implementation for coarse quantization (python, C++): +* Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++): https://github.com/facebookresearch/faiss * Code for the paper ["Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors"](https://arxiv.org/abs/1802.02422) @@ -205,56 +180,27 @@ https://github.com/dbaranchuk/ivf-hnsw * Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna * .Net implementation: https://github.com/microsoft/HNSW.Net -### Contributing to the repository -Contributions are highly welcome! - -Please make pull requests against the `develop` branch. - -### 200M SIFT test reproduction -To download and extract the bigann dataset: -```bash -python3 download_bigann.py -``` -To compile: -```bash -cmake . -make all -``` - -To run the test on 200M SIFT subset: -```bash -./main -``` - -The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**. - -### Updates test -To generate testing data (from root directory): -```bash -cd examples -python update_gen_data.py -``` -To compile (from root directory): -```bash -mkdir build -cd build -cmake .. -make -``` -To run test **without** updates (from `build` directory) -```bash -./test_updates -``` - -To run test **with** updates (from `build` directory) -```bash -./test_updates update -``` - ### HNSW example demos - Visual search engine for 1M amazon products (MXNet + HNSW): [website](https://thomasdelteil.github.io/VisualSearch_MXNet/), [code](https://github.com/ThomasDelteil/VisualSearch_MXNet), demo by [@ThomasDelteil](https://github.com/ThomasDelteil) ### References -Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320 +Reference to cite when you use HNSW or Hnswlib in a research paper: +``` +@article{DBLP:journals/corr/MalkovY16, + author = {Yury A. Malkov and + D. A. Yashunin}, + title = {Efficient and robust approximate nearest neighbor search using Hierarchical + Navigable Small World graphs}, + journal = {CoRR}, + volume = {abs/1603.09320}, + year = {2016}, + url = {http://arxiv.org/abs/1603.09320}, + archivePrefix = {arXiv}, + eprint = {1603.09320}, + timestamp = {Mon, 13 Aug 2018 16:46:53 +0200}, + biburl = {https://dblp.org/rec/journals/corr/MalkovY16.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} +``` diff --git a/TESTS.md b/TESTS.md new file mode 100644 index 00000000..fed82364 --- /dev/null +++ b/TESTS.md @@ -0,0 +1,44 @@ + +# Tests + + +### 200M SIFT test reproduction +To download and extract the bigann dataset: +```bash +python3 download_bigann.py +``` +To compile: +```bash +cmake . +make all +``` + +To run the test on 200M SIFT subset: +```bash +./main +``` + +The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**. + +### Feature Vector Updates test +To generate testing data (from root directory): +```bash +cd examples +python update_gen_data.py +``` +To compile (from root directory): +```bash +mkdir build +cd build +cmake .. +make +``` +To run test **without** updates (from `build` directory) +```bash +./test_updates +``` + +To run test **with** updates (from `build` directory) +```bash +./test_updates update +``` \ No newline at end of file From 42541d8515568be60433c4b2a9e04290ad6e7ddb Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Tue, 18 Aug 2020 11:08:39 -0700 Subject: [PATCH 02/29] Update --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index aec7e7b1..9d80b9c7 100644 --- a/README.md +++ b/README.md @@ -27,8 +27,8 @@ HNSW algorithm is being leveraged globally for performing fast and efficient sim ### Supported Operations 1) Supports batch(offline) and realtime(online) index. -2) Supports incremental query, insert, update and deletion of vectors. -3) Highly performant and efficient support for multi-threaded reads and writes in parallel (query/insert/update/delete). Multi-thread Performance scales with number of cpu cores in machine. +2) Supports multi-threaded incremental query, insert, update and deletion of vectors. +3) Highly performant and efficient locking implementation to support multi-threaded reads and writes in parallel i.e multi-threaded query/insert/update/delete in parallel (Currently in C++ version only). Performance scales with number of cpu cores in machine . 4) Supports efficient serialization and deserialization of index to/from disk. 5) Can support user defined arbitrary and exotic similarity metrics like Hyperbolic distances (Poincare/Lorentzian), Jaccard distance, Manhattan distance etc. (In C++ version) @@ -120,7 +120,6 @@ For other spaces [nmslib](https://github.com/nmslib/nmslib) library can be used * `num_threads` sets the number of cpu threads to use (-1 means use default). * `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their feature vectors will be updated. Note that update procedure is slower than insertion of a new element, but more memory and query-efficient. * Thread-safe with other `add_items` calls, but not with `knn_query`. - * ####### May be expose multi-threaded query method and add documentation #### * `mark_deleted(data_label)` - marks the element as **deleted**, so it will be omitted from search results. @@ -151,7 +150,7 @@ For other spaces [nmslib](https://github.com/nmslib/nmslib) library can be used ### Tests To reproduce performance benchmark results on 200M SIFT dataset as described in HNSW paper or to run tests for feature vector updates please refer [Tests](TESTS.md). -### Authors +### Authors and Contributors - [Yury Malkov](https://github.com/yurymalkov) is the lead author and developer of the HNSW algorithm and Hnswlib library. - [Apoorv Sharma](https://github.com/apoorv-sharma) co-authored an algorithm for performing dynamic updates of feature vectors in HNSW with [Yury Malkov](https://github.com/yurymalkov) and implemented it in HnswLib. From b65f5e8323119c3b762486afbed3d091ff5ab377 Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Tue, 18 Aug 2020 11:10:39 -0700 Subject: [PATCH 03/29] Update --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9d80b9c7..9ac24a06 100644 --- a/README.md +++ b/README.md @@ -28,11 +28,11 @@ HNSW algorithm is being leveraged globally for performing fast and efficient sim 1) Supports batch(offline) and realtime(online) index. 2) Supports multi-threaded incremental query, insert, update and deletion of vectors. -3) Highly performant and efficient locking implementation to support multi-threaded reads and writes in parallel i.e multi-threaded query/insert/update/delete in parallel (Currently in C++ version only). Performance scales with number of cpu cores in machine . +3) Highly performant and efficient locking implementation to support multi-threaded reads and writes in parallel i.e multi-threaded query/insert/update/delete in parallel (Currently exposed in C++ version only). 4) Supports efficient serialization and deserialization of index to/from disk. 5) Can support user defined arbitrary and exotic similarity metrics like Hyperbolic distances (Poincare/Lorentzian), Jaccard distance, Manhattan distance etc. (In C++ version) -*Note: Currently deletions of elements from the index does not free the associated memory of the vectors to be deleted.* +*Note: Currently deletions of elements from the index does not free the associated memory of the vectors to be deleted and performance of read/writes scales with number of cpu cores in machine.* ### Installation It can be installed from sources: From 034201f8c42ae9ea352814a899b25109eab73953 Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Tue, 18 Aug 2020 11:23:02 -0700 Subject: [PATCH 04/29] Update --- README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9ac24a06..fcd0a79a 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,10 @@ + # HNSWLIB - Fast Approximate Nearest Neighbor Search Hnswlib is a C++ library with Python bindings for highly performant implementation of [HNSW](https://arxiv.org/abs/1603.09320) *(Hierarchical Navigable Small World Graphs)* algorithm to perform fast and efficient vector similarity search in high dimensional spaces . It achieves state-of-the-art performance on diverse datasets and one of the top-most leaders in ANN performance benchmarks as show in *[ann-benchmarks.com](http://ann-benchmarks.com)*. -HNSW algorithm is being leveraged globally for performing fast and efficient similarity search. Some public examples for the usage are ***Facebook*** ([Faiss](https://github.com/facebookresearch/faiss)), ***Twitter*** ([Paper](KDD paper link)), ***Pinterest*** ([Paper]([https://arxiv.org/pdf/2007.03634.pdf](https://arxiv.org/pdf/2007.03634.pdf))), ***Amazon*** ([Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)), ***Microsoft*** ([HNSW .NET]([https://github.com/microsoft/HNSW.Net](https://github.com/microsoft/HNSW.Net))), ***Open Distro*** ([Blog](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)) etc. +HNSW algorithm is being leveraged globally for performing fast and efficient similarity search. Some public examples for the usage in the industry are *[Facebook](https://github.com/facebookresearch/faiss)*, *[Twitter](link)*, *[Pinterest](https://arxiv.org/pdf/2007.03634.pdf)*, *[Amazon](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)*, *[Microsoft](https://github.com/microsoft/HNSW.Net)* and *[Open Distro](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)* . ### News @@ -20,7 +21,7 @@ HNSW algorithm is being leveraged globally for performing fast and efficient sim 2) Works well for both low and high dimensional datasets. 3) It belongs to unrestricted memory ANN that allow the vectors to be stored in memory. No bound on RAM allows the best performance in terms of speed and resulting accuracy. 4) Interfaces and bindings for C++, Python. External bindings for [R](https://github.com/jlmelville/rcpphnsw) and [Java](https://github.com/stepstone-tech/hnswlib-jna) contributed by community. -5) Other external implementation of the algorithm available in diverse languages like .Net, Go, Java, Python etc. Refer [other implementations](#other-implementations) +5) Other external implementation of the algorithm available in diverse languages like .Net, Go, Java, Python, Rust, Julia etc. Refer this [section](#other-implementations) for more details. 6) Significantly less memory footprint and faster build time compared to current NMSLIB's implementation. @@ -178,6 +179,8 @@ https://github.com/dbaranchuk/ivf-hnsw * Java implementation: https://github.com/jelmerk/hnswlib * Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna * .Net implementation: https://github.com/microsoft/HNSW.Net +* Rust implementation: https://github.com/rust-cv/hnsw +* Julia implementation: https://juliapackages.com/p/hnsw ### HNSW example demos From 5d16c1fee5c5d992450222047dfc8b27237f70ae Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Tue, 18 Aug 2020 11:35:38 -0700 Subject: [PATCH 05/29] Update --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index fcd0a79a..96718bae 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,11 @@ + # HNSWLIB - Fast Approximate Nearest Neighbor Search -Hnswlib is a C++ library with Python bindings for highly performant implementation of [HNSW](https://arxiv.org/abs/1603.09320) *(Hierarchical Navigable Small World Graphs)* algorithm to perform fast and efficient vector similarity search in high dimensional spaces . It achieves state-of-the-art performance on diverse datasets and one of the top-most leaders in ANN performance benchmarks as show in *[ann-benchmarks.com](http://ann-benchmarks.com)*. -HNSW algorithm is being leveraged globally for performing fast and efficient similarity search. Some public examples for the usage in the industry are *[Facebook](https://github.com/facebookresearch/faiss)*, *[Twitter](link)*, *[Pinterest](https://arxiv.org/pdf/2007.03634.pdf)*, *[Amazon](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)*, *[Microsoft](https://github.com/microsoft/HNSW.Net)* and *[Open Distro](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)* . +Hnswlib is a C++ library with Python bindings to perform fast and efficient vector similarity search in high dimensional spaces. It is high performance implementation of [HNSW](https://arxiv.org/abs/1603.09320) *(Hierarchical Navigable Small World Graphs)* algorithm. It achieves state-of-the-art performance on diverse datasets and one of the top-most leaders in ANN performance benchmarks as shown in *[ann-benchmarks.com](http://ann-benchmarks.com)*. +It is a popular algorithm globally for performing similarity search. Some public examples for the algorithm usage in the industry are *[Facebook](https://github.com/facebookresearch/faiss)*, *[Twitter](link)*, *[Pinterest](https://arxiv.org/pdf/2007.03634.pdf)*, *[Amazon](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)*, *[Microsoft](https://github.com/microsoft/HNSW.Net)* and *[Open Distro](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)* . ### News From f44512402ba9222c83ad809a817fad8dc53c90c1 Mon Sep 17 00:00:00 2001 From: apoorv sharma Date: Wed, 26 Aug 2020 10:49:07 -0700 Subject: [PATCH 06/29] Update --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index 96718bae..b9c63e81 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,7 @@ - - - - # HNSWLIB - Fast Approximate Nearest Neighbor Search Hnswlib is a C++ library with Python bindings to perform fast and efficient vector similarity search in high dimensional spaces. It is high performance implementation of [HNSW](https://arxiv.org/abs/1603.09320) *(Hierarchical Navigable Small World Graphs)* algorithm. It achieves state-of-the-art performance on diverse datasets and one of the top-most leaders in ANN performance benchmarks as shown in *[ann-benchmarks.com](http://ann-benchmarks.com)*. -It is a popular algorithm globally for performing similarity search. Some public examples for the algorithm usage in the industry are *[Facebook](https://github.com/facebookresearch/faiss)*, *[Twitter](link)*, *[Pinterest](https://arxiv.org/pdf/2007.03634.pdf)*, *[Amazon](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)*, *[Microsoft](https://github.com/microsoft/HNSW.Net)* and *[Open Distro](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)* . +It is a popular algorithm globally for performing similarity search. Some public examples for the algorithm usage in the industry are *[Facebook](https://github.com/facebookresearch/faiss)*, *[Twitter](https://irsworkshop.github.io/2020/publications/paper_2_%20Virani_Twitter.pdf)*, *[Pinterest](https://arxiv.org/pdf/2007.03634.pdf)*, *[Amazon](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html)*, *[Microsoft](https://github.com/microsoft/HNSW.Net)* and *[Open Distro](https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch)* . ### News From 1a6878ad1f20aea6b561e356822520c26dd79e6a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:49:56 +0100 Subject: [PATCH 07/29] Add support for Python 3.8 --- .travis.yml | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/.travis.yml b/.travis.yml index 2c3c9960..b89259de 100644 --- a/.travis.yml +++ b/.travis.yml @@ -9,7 +9,11 @@ jobs: - name: Linux Python 3.7 os: linux python: 3.7 - + + - name: Linux Python 3.8 + os: linux + python: 3.8 + - name: Windows Python 3.6 os: windows language: shell # 'language: python' is an error on Travis CI Windows @@ -28,6 +32,15 @@ jobs: - python --version env: PATH=/c/Python37:/c/Python37/Scripts:$PATH + - name: Windows Python 3.8 + os: windows + language: shell # 'language: python' is an error on Travis CI Windows + before_install: + - choco install python --version 3.8.0 + - python -m pip install --upgrade pip + - python --version + env: PATH=/c/Python38:/c/Python38/Scripts:$PATH + install: - | python -m pip install . From 0869ed59b1d281729d7ede84b36897f76e9dd7d3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:50:48 +0100 Subject: [PATCH 08/29] Add support for Python 3.9 --- .travis.yml | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/.travis.yml b/.travis.yml index b89259de..76f7d7d4 100644 --- a/.travis.yml +++ b/.travis.yml @@ -14,6 +14,10 @@ jobs: os: linux python: 3.8 + - name: Linux Python 3.9 + os: linux + python: 3.9 + - name: Windows Python 3.6 os: windows language: shell # 'language: python' is an error on Travis CI Windows @@ -41,6 +45,15 @@ jobs: - python --version env: PATH=/c/Python38:/c/Python38/Scripts:$PATH + - name: Windows Python 3.9 + os: windows + language: shell # 'language: python' is an error on Travis CI Windows + before_install: + - choco install python --version 3.9.0 + - python -m pip install --upgrade pip + - python --version + env: PATH=/c/Python39:/c/Python39/Scripts:$PATH + install: - | python -m pip install . From 243783086ec1f7336e41cabbef38a4958de3cc3f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:57:37 +0100 Subject: [PATCH 09/29] List comprehension improves speed of KNN query --- examples/pyw_hnswlib.py | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/examples/pyw_hnswlib.py b/examples/pyw_hnswlib.py index dc300173..e450aa15 100644 --- a/examples/pyw_hnswlib.py +++ b/examples/pyw_hnswlib.py @@ -55,8 +55,7 @@ def knn_query(self, data, k=1): labels_int, distances = self.index.knn_query(data=data, k=k) labels = [] for li in labels_int: - line = [] - for l in li: - line.append(self.dict_labels[l]) - labels.append(line) + labels.append( + [self.dict_labels[l] for l in li] + ) return labels, distances From dfcade75882c5cd5f726cf260e82272522b1a7d6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:32:06 +0100 Subject: [PATCH 10/29] Fix PEP 8: E251 unexpected spaces around keyword / parameter equals --- examples/pyw_hnswlib.py | 4 ++-- python_bindings/tests/bindings_test.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/examples/pyw_hnswlib.py b/examples/pyw_hnswlib.py index dc300173..2d1e70bc 100644 --- a/examples/pyw_hnswlib.py +++ b/examples/pyw_hnswlib.py @@ -11,8 +11,8 @@ def __init__(self, space, dim): self.dict_labels = {} self.cur_ind = 0 - def init_index(self, max_elements, ef_construction = 200, M = 16): - self.index.init_index(max_elements = max_elements, ef_construction = ef_construction, M = M) + def init_index(self, max_elements, ef_construction=200, M=16): + self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M) def add_items(self, data, ids=None): if ids is not None: diff --git a/python_bindings/tests/bindings_test.py b/python_bindings/tests/bindings_test.py index d718bc3b..771218a6 100644 --- a/python_bindings/tests/bindings_test.py +++ b/python_bindings/tests/bindings_test.py @@ -26,7 +26,7 @@ def testRandomSelf(self): # M - is tightly connected with internal dimensionality of the data # stronlgy affects the memory consumption - p.init_index(max_elements = num_elements, ef_construction = 100, M = 16) + p.init_index(max_elements=num_elements, ef_construction=100, M=16) # Controlling the recall by setting ef: # higher ef leads to better accuracy, but slower search From 76347dfc194449efa6546102d5534a72b435413c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:34:02 +0100 Subject: [PATCH 11/29] Fix PEP 8: E111 indentation is not a multiple of four --- python_bindings/tests/bindings_test_resize.py | 98 +++++++++---------- 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/python_bindings/tests/bindings_test_resize.py b/python_bindings/tests/bindings_test_resize.py index 3c4e3e4f..212c6ac2 100644 --- a/python_bindings/tests/bindings_test_resize.py +++ b/python_bindings/tests/bindings_test_resize.py @@ -7,71 +7,71 @@ class RandomSelfTestCase(unittest.TestCase): def testRandomSelf(self): - for idx in range(16): - print("\n**** Index resize test ****\n") + for idx in range(16): + print("\n**** Index resize test ****\n") - np.random.seed(idx) - dim = 16 - num_elements = 10000 + np.random.seed(idx) + dim = 16 + num_elements = 10000 - # Generating sample data - data = np.float32(np.random.random((num_elements, dim))) + # Generating sample data + data = np.float32(np.random.random((num_elements, dim))) - # Declaring index - p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip + # Declaring index + p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - # Initing index - # max_elements - the maximum number of elements, should be known beforehand - # (probably will be made optional in the future) - # - # ef_construction - controls index search speed/build speed tradeoff - # M - is tightly connected with internal dimensionality of the data - # stronlgy affects the memory consumption + # Initing index + # max_elements - the maximum number of elements, should be known beforehand + # (probably will be made optional in the future) + # + # ef_construction - controls index search speed/build speed tradeoff + # M - is tightly connected with internal dimensionality of the data + # stronlgy affects the memory consumption - p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) + p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) - # Controlling the recall by setting ef: - # higher ef leads to better accuracy, but slower search - p.set_ef(20) + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + p.set_ef(20) - p.set_num_threads(idx%8) # by default using all available cores + p.set_num_threads(idx%8) # by default using all available cores - # We split the data in two batches: - data1 = data[:num_elements // 2] - data2 = data[num_elements // 2:] + # We split the data in two batches: + data1 = data[:num_elements // 2] + data2 = data[num_elements // 2:] - print("Adding first batch of %d elements" % (len(data1))) - p.add_items(data1) + print("Adding first batch of %d elements" % (len(data1))) + p.add_items(data1) - # Query the elements for themselves and measure recall: - labels, distances = p.knn_query(data1, k=1) + # Query the elements for themselves and measure recall: + labels, distances = p.knn_query(data1, k=1) - items = p.get_items(list(range(len(data1)))) + items = p.get_items(list(range(len(data1)))) - # Check the recall: - self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3) + # Check the recall: + self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3) - # Check that the returned element data is correct: - diff_with_gt_labels = np.max(np.abs(data1-items)) - self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) + # Check that the returned element data is correct: + diff_with_gt_labels = np.max(np.abs(data1-items)) + self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) - print("Resizing the index") - p.resize_index(num_elements) + print("Resizing the index") + p.resize_index(num_elements) - print("Adding the second batch of %d elements" % (len(data2))) - p.add_items(data2) + print("Adding the second batch of %d elements" % (len(data2))) + p.add_items(data2) - # Query the elements for themselves and measure recall: - labels, distances = p.knn_query(data, k=1) - items=p.get_items(list(range(num_elements))) + # Query the elements for themselves and measure recall: + labels, distances = p.knn_query(data, k=1) + items=p.get_items(list(range(num_elements))) - # Check the recall: - self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3) + # Check the recall: + self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3) - # Check that the returned element data is correct: - diff_with_gt_labels=np.max(np.abs(data-items)) - self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) + # Check that the returned element data is correct: + diff_with_gt_labels=np.max(np.abs(data-items)) + self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) - # Checking that all labels are returned correcly: - sorted_labels=sorted(p.get_ids_list()) - self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0) + # Checking that all labels are returned correcly: + sorted_labels=sorted(p.get_ids_list()) + self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0) From 177c3622509328c30afc5ad437539b0f539cab43 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:35:13 +0100 Subject: [PATCH 12/29] Fix PEP 8: E228 missing whitespace around modulo operator --- python_bindings/tests/bindings_test_resize.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python_bindings/tests/bindings_test_resize.py b/python_bindings/tests/bindings_test_resize.py index 212c6ac2..3d7a1987 100644 --- a/python_bindings/tests/bindings_test_resize.py +++ b/python_bindings/tests/bindings_test_resize.py @@ -34,7 +34,7 @@ def testRandomSelf(self): # higher ef leads to better accuracy, but slower search p.set_ef(20) - p.set_num_threads(idx%8) # by default using all available cores + p.set_num_threads(idx % 8) # by default using all available cores # We split the data in two batches: data1 = data[:num_elements // 2] From 4643db6aa3d15a9bcdee790d1646805c0dde6907 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:38:40 +0100 Subject: [PATCH 13/29] Fix PEP 8: E225 missing whitespace around operator --- python_bindings/tests/bindings_test_labels.py | 6 +++--- python_bindings/tests/bindings_test_pickle.py | 6 +++--- python_bindings/tests/bindings_test_resize.py | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/python_bindings/tests/bindings_test_labels.py b/python_bindings/tests/bindings_test_labels.py index 5c13e198..ab74df27 100644 --- a/python_bindings/tests/bindings_test_labels.py +++ b/python_bindings/tests/bindings_test_labels.py @@ -47,7 +47,7 @@ def testRandomSelf(self): # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data1, k=1) - items=p.get_items(labels) + items = p.get_items(labels) # Check the recall: self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3) @@ -86,11 +86,11 @@ def testRandomSelf(self): self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3) # Check that the returned element data is correct: - diff_with_gt_labels=np.mean(np.abs(data-items)) + diff_with_gt_labels = np.mean(np.abs(data-items)) self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # deleting index. # Checking that all labels are returned correctly: - sorted_labels=sorted(p.get_ids_list()) + sorted_labels = sorted(p.get_ids_list()) self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0) # Delete data1 diff --git a/python_bindings/tests/bindings_test_pickle.py b/python_bindings/tests/bindings_test_pickle.py index 3a42df2e..c9ac6344 100644 --- a/python_bindings/tests/bindings_test_pickle.py +++ b/python_bindings/tests/bindings_test_pickle.py @@ -86,9 +86,9 @@ def test_space_main(self, space, dim): l1, d1 = p1.knn_query(test_data, k=self.k) l2, d2 = p2.knn_query(test_data, k=self.k) - self.assertLessEqual(np.sum(((d-d0)**2.)>1e-3), self.dists_err_thresh, msg=f"knn distances returned by p and p0 must match") - self.assertLessEqual(np.sum(((d0-d1)**2.)>1e-3), self.dists_err_thresh, msg=f"knn distances returned by p0 and p1 must match") - self.assertLessEqual(np.sum(((d1-d2)**2.)>1e-3), self.dists_err_thresh, msg=f"knn distances returned by p1 and p2 must match") + self.assertLessEqual(np.sum(((d-d0)**2.) > 1e-3), self.dists_err_thresh, msg=f"knn distances returned by p and p0 must match") + self.assertLessEqual(np.sum(((d0-d1)**2.) > 1e-3), self.dists_err_thresh, msg=f"knn distances returned by p0 and p1 must match") + self.assertLessEqual(np.sum(((d1-d2)**2.) > 1e-3), self.dists_err_thresh, msg=f"knn distances returned by p1 and p2 must match") ### check if ann results match brute-force search ### allow for 2 labels to be missing from ann results diff --git a/python_bindings/tests/bindings_test_resize.py b/python_bindings/tests/bindings_test_resize.py index 3d7a1987..bbe2ebff 100644 --- a/python_bindings/tests/bindings_test_resize.py +++ b/python_bindings/tests/bindings_test_resize.py @@ -69,9 +69,9 @@ def testRandomSelf(self): self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3) # Check that the returned element data is correct: - diff_with_gt_labels=np.max(np.abs(data-items)) + diff_with_gt_labels = np.max(np.abs(data-items)) self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # Checking that all labels are returned correcly: - sorted_labels=sorted(p.get_ids_list()) + sorted_labels = sorted(p.get_ids_list()) self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0) From 0d15d869990dd752e3fe8a594aea3f8a9a546ccc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:38:53 +0100 Subject: [PATCH 14/29] Typo fixes --- python_bindings/tests/bindings_test_getdata.py | 2 +- python_bindings/tests/bindings_test_labels.py | 4 ++-- python_bindings/tests/bindings_test_resize.py | 6 +++--- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/python_bindings/tests/bindings_test_getdata.py b/python_bindings/tests/bindings_test_getdata.py index 8655d7f8..35c20f61 100644 --- a/python_bindings/tests/bindings_test_getdata.py +++ b/python_bindings/tests/bindings_test_getdata.py @@ -19,7 +19,7 @@ def testGettingItems(self): # Declaring index p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - # Initing index + # Initiating index # max_elements - the maximum number of elements, should be known beforehand # (probably will be made optional in the future) # diff --git a/python_bindings/tests/bindings_test_labels.py b/python_bindings/tests/bindings_test_labels.py index ab74df27..592dd2de 100644 --- a/python_bindings/tests/bindings_test_labels.py +++ b/python_bindings/tests/bindings_test_labels.py @@ -67,8 +67,8 @@ def testRandomSelf(self): print("Deleted") print("\n**** Mark delete test ****\n") - # Reiniting, loading the index - print("Reiniting") + # Re-initiating, loading the index + print("Re-initiating") p = hnswlib.Index(space='l2', dim=dim) print("\nLoading index from '%s'\n" % index_path) diff --git a/python_bindings/tests/bindings_test_resize.py b/python_bindings/tests/bindings_test_resize.py index bbe2ebff..b5bceeb1 100644 --- a/python_bindings/tests/bindings_test_resize.py +++ b/python_bindings/tests/bindings_test_resize.py @@ -20,13 +20,13 @@ def testRandomSelf(self): # Declaring index p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - # Initing index + # Initiating index # max_elements - the maximum number of elements, should be known beforehand # (probably will be made optional in the future) # # ef_construction - controls index search speed/build speed tradeoff # M - is tightly connected with internal dimensionality of the data - # stronlgy affects the memory consumption + # strongly affects the memory consumption p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) @@ -72,6 +72,6 @@ def testRandomSelf(self): diff_with_gt_labels = np.max(np.abs(data-items)) self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) - # Checking that all labels are returned correcly: + # Checking that all labels are returned correctly: sorted_labels = sorted(p.get_ids_list()) self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0) From 20555b79320d0a5325d52756b48d86ad608d615a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Tue, 2 Feb 2021 09:40:18 +0100 Subject: [PATCH 15/29] Fix PEP 8: E266 too many leading '#' for block comment --- python_bindings/tests/bindings_test_pickle.py | 30 +++++++++---------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/python_bindings/tests/bindings_test_pickle.py b/python_bindings/tests/bindings_test_pickle.py index c9ac6344..07820b1d 100644 --- a/python_bindings/tests/bindings_test_pickle.py +++ b/python_bindings/tests/bindings_test_pickle.py @@ -60,27 +60,27 @@ def test_space_main(self, space, dim): p.num_threads = self.num_threads # by default using all available cores - p0 = pickle.loads(pickle.dumps(p)) ### pickle un-initialized Index + p0 = pickle.loads(pickle.dumps(p)) # pickle un-initialized Index p.init_index(max_elements=self.num_elements, ef_construction=self.ef_construction, M=self.M) p0.init_index(max_elements=self.num_elements, ef_construction=self.ef_construction, M=self.M) p.ef = self.ef p0.ef = self.ef - p1 = pickle.loads(pickle.dumps(p)) ### pickle Index before adding items + p1 = pickle.loads(pickle.dumps(p)) # pickle Index before adding items - ### add items to ann index p,p0,p1 + # add items to ann index p,p0,p1 p.add_items(data) p1.add_items(data) p0.add_items(data) - p2=pickle.loads(pickle.dumps(p)) ### pickle Index before adding items + p2=pickle.loads(pickle.dumps(p)) # pickle Index before adding items self.assertTrue(np.allclose(p.get_items(), p0.get_items()), "items for p and p0 must be same") self.assertTrue(np.allclose(p0.get_items(), p1.get_items()), "items for p0 and p1 must be same") self.assertTrue(np.allclose(p1.get_items(), p2.get_items()), "items for p1 and p2 must be same") - ### Test if returned distances are same + # Test if returned distances are same l, d = p.knn_query(test_data, k=self.k) l0, d0 = p0.knn_query(test_data, k=self.k) l1, d1 = p1.knn_query(test_data, k=self.k) @@ -90,8 +90,8 @@ def test_space_main(self, space, dim): self.assertLessEqual(np.sum(((d0-d1)**2.) > 1e-3), self.dists_err_thresh, msg=f"knn distances returned by p0 and p1 must match") self.assertLessEqual(np.sum(((d1-d2)**2.) > 1e-3), self.dists_err_thresh, msg=f"knn distances returned by p1 and p2 must match") - ### check if ann results match brute-force search - ### allow for 2 labels to be missing from ann results + # check if ann results match brute-force search + # allow for 2 labels to be missing from ann results check_ann_results(self, space, data, test_data, self.k, l, d, err_thresh=self.label_err_thresh, total_thresh=self.item_err_thresh, @@ -102,19 +102,19 @@ def test_space_main(self, space, dim): total_thresh=self.item_err_thresh, dists_thresh=self.dists_err_thresh) - ### Check ef parameter value + # Check ef parameter value self.assertEqual(p.ef, self.ef, "incorrect value of p.ef") self.assertEqual(p0.ef, self.ef, "incorrect value of p0.ef") self.assertEqual(p2.ef, self.ef, "incorrect value of p2.ef") self.assertEqual(p1.ef, self.ef, "incorrect value of p1.ef") - ### Check M parameter value + # Check M parameter value self.assertEqual(p.M, self.M, "incorrect value of p.M") self.assertEqual(p0.M, self.M, "incorrect value of p0.M") self.assertEqual(p1.M, self.M, "incorrect value of p1.M") self.assertEqual(p2.M, self.M, "incorrect value of p2.M") - ### Check ef_construction parameter value + # Check ef_construction parameter value self.assertEqual(p.ef_construction, self.ef_construction, "incorrect value of p.ef_construction") self.assertEqual(p0.ef_construction, self.ef_construction, "incorrect value of p0.ef_construction") self.assertEqual(p1.ef_construction, self.ef_construction, "incorrect value of p1.ef_construction") @@ -135,12 +135,12 @@ def setUp(self): self.num_threads = 4 self.k = 25 - self.label_err_thresh = 5 ### max number of missing labels allowed per test item - self.item_err_thresh = 5 ### max number of items allowed with incorrect labels + self.label_err_thresh = 5 # max number of missing labels allowed per test item + self.item_err_thresh = 5 # max number of items allowed with incorrect labels - self.dists_err_thresh = 50 ### for two matrices, d1 and d2, dists_err_thresh controls max - ### number of value pairs that are allowed to be different in d1 and d2 - ### i.e., number of values that are (d1-d2)**2>1e-3 + self.dists_err_thresh = 50 # for two matrices, d1 and d2, dists_err_thresh controls max + # number of value pairs that are allowed to be different in d1 and d2 + # i.e., number of values that are (d1-d2)**2>1e-3 def test_inner_product_space(self): test_space_main(self, 'ip', 48) From 0e3845f879af115149e9dd79e63e7d640a1a39de Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Mon, 8 Feb 2021 16:31:42 +0100 Subject: [PATCH 16/29] Fix missed typo --- python_bindings/tests/bindings_test.py | 6 +++--- python_bindings/tests/bindings_test_getdata.py | 2 +- python_bindings/tests/bindings_test_labels.py | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/python_bindings/tests/bindings_test.py b/python_bindings/tests/bindings_test.py index 771218a6..f9b3092f 100644 --- a/python_bindings/tests/bindings_test.py +++ b/python_bindings/tests/bindings_test.py @@ -18,13 +18,13 @@ def testRandomSelf(self): # Declaring index p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - # Initing index + # Initiating index # max_elements - the maximum number of elements, should be known beforehand # (probably will be made optional in the future) # # ef_construction - controls index search speed/build speed tradeoff # M - is tightly connected with internal dimensionality of the data - # stronlgy affects the memory consumption + # strongly affects the memory consumption p.init_index(max_elements=num_elements, ef_construction=100, M=16) @@ -51,7 +51,7 @@ def testRandomSelf(self): p.save_index(index_path) del p - # Reiniting, loading the index + # Re-initiating, loading the index p = hnswlib.Index(space='l2', dim=dim) # you can change the sa print("\nLoading index from '%s'\n" % index_path) diff --git a/python_bindings/tests/bindings_test_getdata.py b/python_bindings/tests/bindings_test_getdata.py index 35c20f61..2985c1dd 100644 --- a/python_bindings/tests/bindings_test_getdata.py +++ b/python_bindings/tests/bindings_test_getdata.py @@ -25,7 +25,7 @@ def testGettingItems(self): # # ef_construction - controls index search speed/build speed tradeoff # M - is tightly connected with internal dimensionality of the data - # stronlgy affects the memory consumption + # strongly affects the memory consumption p.init_index(max_elements=num_elements, ef_construction=100, M=16) diff --git a/python_bindings/tests/bindings_test_labels.py b/python_bindings/tests/bindings_test_labels.py index 592dd2de..b3cbfcf1 100644 --- a/python_bindings/tests/bindings_test_labels.py +++ b/python_bindings/tests/bindings_test_labels.py @@ -21,13 +21,13 @@ def testRandomSelf(self): # Declaring index p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip - # Initing index + # Initiating index # max_elements - the maximum number of elements, should be known beforehand # (probably will be made optional in the future) # # ef_construction - controls index search speed/build speed tradeoff # M - is tightly connected with internal dimensionality of the data - # stronlgy affects the memory consumption + # strongly affects the memory consumption p.init_index(max_elements=num_elements, ef_construction=100, M=16) From 8481a4bf1fcf30c894e587a3926d6b895d3d6174 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Hanu=C5=A1?= Date: Mon, 8 Feb 2021 16:34:27 +0100 Subject: [PATCH 17/29] Fix PEP 8: E225 missing whitespace around operator --- python_bindings/tests/bindings_test_labels.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python_bindings/tests/bindings_test_labels.py b/python_bindings/tests/bindings_test_labels.py index b3cbfcf1..668d7694 100644 --- a/python_bindings/tests/bindings_test_labels.py +++ b/python_bindings/tests/bindings_test_labels.py @@ -80,7 +80,7 @@ def testRandomSelf(self): # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data, k=1) - items=p.get_items(labels) + items = p.get_items(labels) # Check the recall: self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3) From e2000e9eddab3369673a745c65393739c36c8ce9 Mon Sep 17 00:00:00 2001 From: TakaakiFuruse Date: Thu, 25 Feb 2021 15:36:17 +0900 Subject: [PATCH 18/29] Improved description of `add_items` This is just a suggestion of doc improvement. For `add_items` description, I have made `labels` part and `data_labels` part together since... 1. There's no argument called `labels` for `add_items` func. 2. It felt like `labels` were a typo of `data_labels` from a commit 5c20009. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8d139fdc..4acdfa6b 100644 --- a/README.md +++ b/README.md @@ -47,9 +47,9 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib. * `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)). * `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. - * `labels` is an optional N-size numpy array of integer labels for all elements in `data`. * `num_threads` sets the number of cpu threads to use (-1 means use default). - * `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient. + * `data_labels` are optional N-size numpy array of integer labels for all elements in `data`. + - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient. * Thread-safe with other `add_items` calls, but not with `knn_query`. * `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results. From 95d6b0275a57785ebb497a947afc4198f5d00eee Mon Sep 17 00:00:00 2001 From: TakaakiFuruse Date: Sat, 6 Mar 2021 14:33:44 +0900 Subject: [PATCH 19/29] data_labels => ids ref: https://github.com/nmslib/hnswlib/pull/289#issuecomment-789353096 --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4acdfa6b..45547478 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib. * `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)). * `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)). -* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. +* `add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure. * `num_threads` sets the number of cpu threads to use (-1 means use default). - * `data_labels` are optional N-size numpy array of integer labels for all elements in `data`. + * `ids` are optional N-size numpy array of integer labels for all elements in `data`. - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient. * Thread-safe with other `add_items` calls, but not with `knn_query`. From 9ba16e24be66b5595b9b9702d7819b2e2797e278 Mon Sep 17 00:00:00 2001 From: TakaakiFuruse Date: Sat, 6 Mar 2021 14:38:22 +0900 Subject: [PATCH 20/29] data_label => label --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 45547478..3f14a05c 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib. - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient. * Thread-safe with other `add_items` calls, but not with `knn_query`. -* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results. +* `mark_deleted(label)` - marks the element as deleted, so it will be omitted from search results. * `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`. From 1afdce0e908a773af447f44247fe68113d1f2b27 Mon Sep 17 00:00:00 2001 From: TakaakiFuruse Date: Sat, 6 Mar 2021 14:54:41 +0900 Subject: [PATCH 21/29] fixed sample code, data_labels => ids --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 3f14a05c..7057d58f 100644 --- a/README.md +++ b/README.md @@ -113,7 +113,7 @@ num_elements = 10000 # Generating sample data data = np.float32(np.random.random((num_elements, dim))) -data_labels = np.arange(num_elements) +ids = np.arange(num_elements) # Declaring index p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip @@ -122,7 +122,7 @@ p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or p.init_index(max_elements = num_elements, ef_construction = 200, M = 16) # Element insertion (can be called several times): -p.add_items(data, data_labels) +p.add_items(data, ids) # Controlling the recall by setting ef: p.set_ef(50) # ef should always be > k From af284e6c3f408d31e9123f783dfeeb77eb54b8c6 Mon Sep 17 00:00:00 2001 From: TakaakiFuruse Date: Sat, 6 Mar 2021 15:00:42 +0900 Subject: [PATCH 22/29] changed order of args --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7057d58f..b2c0166c 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib. * `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`. `hnswlib.Index` methods: -* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements. +* `init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements. * `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk). * `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)). * `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)). From afaaeb5a42d34ba077278c10f7bb2c2cd8cc03c5 Mon Sep 17 00:00:00 2001 From: "shengjun.li" Date: Wed, 10 Mar 2021 10:09:30 +0800 Subject: [PATCH 23/29] Use realloc to simplify the code Signed-off-by: shengjun.li --- hnswlib/hnswalg.h | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) diff --git a/hnswlib/hnswalg.h b/hnswlib/hnswalg.h index a2f72dc7..10e26f64 100644 --- a/hnswlib/hnswalg.h +++ b/hnswlib/hnswalg.h @@ -573,29 +573,23 @@ namespace hnswlib { visited_list_pool_ = new VisitedListPool(1, new_max_elements); - element_levels_.resize(new_max_elements); std::vector(new_max_elements).swap(link_list_locks_); // Reallocate base layer - char * data_level0_memory_new = (char *) malloc(new_max_elements * size_data_per_element_); + char * data_level0_memory_new = (char *) realloc(data_level0_memory_, new_max_elements * size_data_per_element_); if (data_level0_memory_new == nullptr) throw std::runtime_error("Not enough memory: resizeIndex failed to allocate base layer"); - memcpy(data_level0_memory_new, data_level0_memory_,cur_element_count * size_data_per_element_); - free(data_level0_memory_); - data_level0_memory_=data_level0_memory_new; + data_level0_memory_ = data_level0_memory_new; // Reallocate all other layers - char ** linkLists_new = (char **) malloc(sizeof(void *) * new_max_elements); + char ** linkLists_new = (char **) realloc(linkLists_, sizeof(void *) * new_max_elements); if (linkLists_new == nullptr) throw std::runtime_error("Not enough memory: resizeIndex failed to allocate other layers"); - memcpy(linkLists_new, linkLists_,cur_element_count * sizeof(void *)); - free(linkLists_); - linkLists_=linkLists_new; - - max_elements_=new_max_elements; + linkLists_ = linkLists_new; + max_elements_ = new_max_elements; } void saveIndex(const std::string &location) { From 1437b1ef989479f7c3036c8ed853de6787435393 Mon Sep 17 00:00:00 2001 From: Peter Sobot Date: Tue, 23 Mar 2021 09:32:28 -0400 Subject: [PATCH 24/29] Throw an exception if passed an unrecognized space_name. --- python_bindings/bindings.cpp | 2 ++ 1 file changed, 2 insertions(+) diff --git a/python_bindings/bindings.cpp b/python_bindings/bindings.cpp index 93b79a60..cc481b3e 100644 --- a/python_bindings/bindings.cpp +++ b/python_bindings/bindings.cpp @@ -97,6 +97,8 @@ class Index { else if(space_name=="cosine") { l2space = new hnswlib::InnerProductSpace(dim); normalize=true; + } else { + throw new std::runtime_error("Space name must be one of l2, ip, or cosine."); } appr_alg = NULL; ep_added = true; From 1492527bf70321f25fa6008bef3b565a6a825df0 Mon Sep 17 00:00:00 2001 From: apoorvsharma Date: Sun, 9 May 2021 19:29:37 -0700 Subject: [PATCH 25/29] Modify hnsw update logic to unmark the deleted element --- TESTS.md | 44 -------------------------------------------- hnswlib/hnswalg.h | 6 +++++- 2 files changed, 5 insertions(+), 45 deletions(-) delete mode 100644 TESTS.md diff --git a/TESTS.md b/TESTS.md deleted file mode 100644 index fed82364..00000000 --- a/TESTS.md +++ /dev/null @@ -1,44 +0,0 @@ - -# Tests - - -### 200M SIFT test reproduction -To download and extract the bigann dataset: -```bash -python3 download_bigann.py -``` -To compile: -```bash -cmake . -make all -``` - -To run the test on 200M SIFT subset: -```bash -./main -``` - -The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**. - -### Feature Vector Updates test -To generate testing data (from root directory): -```bash -cd examples -python update_gen_data.py -``` -To compile (from root directory): -```bash -mkdir build -cd build -cmake .. -make -``` -To run test **without** updates (from `build` directory) -```bash -./test_updates -``` - -To run test **with** updates (from `build` directory) -```bash -./test_updates update -``` \ No newline at end of file diff --git a/hnswlib/hnswalg.h b/hnswlib/hnswalg.h index a2f72dc7..0bad925d 100644 --- a/hnswlib/hnswalg.h +++ b/hnswlib/hnswalg.h @@ -987,11 +987,15 @@ namespace hnswlib { auto search = label_lookup_.find(label); if (search != label_lookup_.end()) { tableint existingInternalId = search->second; - templock_curr.unlock(); std::unique_lock lock_el_update(link_list_update_locks_[(existingInternalId & (max_update_element_locks - 1))]); + + if (isMarkedDeleted(existingInternalId)) { + unmarkDeletedInternal(existingInternalId); + } updatePoint(data_point, existingInternalId, 1.0); + return existingInternalId; } From 6ec9bada8d7293123e06d7d358cf5f8a60ef3e73 Mon Sep 17 00:00:00 2001 From: Yury Malkov Date: Tue, 1 Jun 2021 22:06:48 -0700 Subject: [PATCH 26/29] fix forgotten flag --- python_bindings/bindings.cpp | 1 + 1 file changed, 1 insertion(+) diff --git a/python_bindings/bindings.cpp b/python_bindings/bindings.cpp index cc481b3e..285b5185 100644 --- a/python_bindings/bindings.cpp +++ b/python_bindings/bindings.cpp @@ -164,6 +164,7 @@ class Index { } appr_alg = new hnswlib::HierarchicalNSW(l2space, path_to_index, false, max_elements); cur_l = appr_alg->cur_element_count; + index_inited = true; } void normalize_vector(float *data, float *norm_array){ From aa3de3e5c0fe7ac2a77b89888bf8ea837e749c43 Mon Sep 17 00:00:00 2001 From: Yury Malkov Date: Mon, 28 Jun 2021 20:53:48 -0700 Subject: [PATCH 27/29] Bump version --- setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/setup.py b/setup.py index d9e57086..92a8ee61 100644 --- a/setup.py +++ b/setup.py @@ -7,7 +7,7 @@ from setuptools import Extension, setup from setuptools.command.build_ext import build_ext -__version__ = '0.5.1' +__version__ = '0.5.2' include_dirs = [ From 8992ebb37af290903707f6d9fa8310b4ebe6bd79 Mon Sep 17 00:00:00 2001 From: Yury Malkov Date: Mon, 28 Jun 2021 22:28:04 -0700 Subject: [PATCH 28/29] add information about the 0.5.2 release --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d6171e15..b5cf39dc 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,9 @@ Header-only C++ HNSW implementation with python bindings. Paper's code for the H **NEWS:** +* **Hnswlib is now 0.5.2**. Bugfixes - thanks [@marekhanus](https://github.com/marekhanus) for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; [@apoorv-sharma](https://github.com/apoorv-sharma) for fixing the bug int the insertion/deletion logic; [@shengjun1985](https://github.com/shengjun1985) for simplifying the memory reallocation logic; [@TakaakiFuruse](https://github.com/TakaakiFuruse) for improved description of `add_items`; [@psobot ](https://github.com/psobot) for improving error handling; [@ShuAiii](https://github.com/ShuAiii) for reporting the bug in the python interface -* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!** +* **Hnswlib is now 0.5.0**. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)! * **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).** @@ -295,4 +296,4 @@ To run test **with** updates (from `build` directory) ### References -Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320 \ No newline at end of file +Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320 From 2235aad105d988768f5c40628874ee64fef1b821 Mon Sep 17 00:00:00 2001 From: Yury Malkov Date: Tue, 29 Jun 2021 21:15:27 -0700 Subject: [PATCH 29/29] Update README.md --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b5cf39dc..4ca5584d 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # Hnswlib - fast approximate nearest neighbor search -Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment +Header-only C++ HNSW implementation with python bindings. **NEWS:** @@ -296,4 +296,13 @@ To run test **with** updates (from `build` directory) ### References -Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320 +@article{malkov2018efficient, + title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs}, + author={Malkov, Yu A and Yashunin, Dmitry A}, + journal={IEEE transactions on pattern analysis and machine intelligence}, + volume={42}, + number={4}, + pages={824--836}, + year={2018}, + publisher={IEEE} +}