Skip to content

Commit

Permalink
Merge pull request #323 from nmslib/develop
Browse files Browse the repository at this point in the history
Merge 0.5.2 changes into master
  • Loading branch information
yurymalkov authored Jun 30, 2021
2 parents d59f8d9 + 2235aad commit 1866a1d
Show file tree
Hide file tree
Showing 11 changed files with 147 additions and 111 deletions.
28 changes: 27 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,15 @@ jobs:
- name: Linux Python 3.7
os: linux
python: 3.7


- name: Linux Python 3.8
os: linux
python: 3.8

- name: Linux Python 3.9
os: linux
python: 3.9

- name: Windows Python 3.6
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
Expand All @@ -28,6 +36,24 @@ jobs:
- python --version
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH

- name: Windows Python 3.8
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.8.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH

- name: Windows Python 3.9
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.9.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python39:/c/Python39/Scripts:$PATH

install:
- |
python -m pip install .
Expand Down
30 changes: 20 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Hnswlib - fast approximate nearest neighbor search
Header-only C++ HNSW implementation with python bindings. Paper's code for the HNSW 200M SIFT experiment
Header-only C++ HNSW implementation with python bindings.

**NEWS:**

* **Hnswlib is now 0.5.2**. Bugfixes - thanks [@marekhanus](https://github.com/marekhanus) for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; [@apoorv-sharma](https://github.com/apoorv-sharma) for fixing the bug int the insertion/deletion logic; [@shengjun1985](https://github.com/shengjun1985) for simplifying the memory reallocation logic; [@TakaakiFuruse](https://github.com/TakaakiFuruse) for improved description of `add_items`; [@psobot ](https://github.com/psobot) for improving error handling; [@ShuAiii](https://github.com/ShuAiii) for reporting the bug in the python interface

* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**
* **Hnswlib is now 0.5.0**. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!

* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**

Expand Down Expand Up @@ -41,18 +42,18 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.

`hnswlib.Index` methods:
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
* `init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100)` initializes the index from with no elements.
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
* `M` defines tha maximum number of outgoing connections in the graph ([ALGO_PARAMS.md](ALGO_PARAMS.md)).

* `add_items(data, data_labels, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
* `labels` is an optional N-size numpy array of integer labels for all elements in `data`.
* `add_items(data, ids, num_threads = -1)` - inserts the `data`(numpy array of vectors, shape:`N*dim`) into the structure.
* `num_threads` sets the number of cpu threads to use (-1 means use default).
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
* `ids` are optional N-size numpy array of integer labels for all elements in `data`.
- If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
* Thread-safe with other `add_items` calls, but not with `knn_query`.

* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results.
* `mark_deleted(label)` - marks the element as deleted, so it will be omitted from search results.

* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.

Expand Down Expand Up @@ -113,7 +114,7 @@ num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
data_labels = np.arange(num_elements)
ids = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
Expand All @@ -122,7 +123,7 @@ p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, data_labels)
p.add_items(data, ids)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k
Expand Down Expand Up @@ -295,4 +296,13 @@ To run test **with** updates (from `build` directory)

### References

Malkov, Yu A., and D. A. Yashunin. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." TPAMI, preprint: https://arxiv.org/abs/1603.09320
@article{malkov2018efficient,
title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs},
author={Malkov, Yu A and Yashunin, Dmitry A},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={42},
number={4},
pages={824--836},
year={2018},
publisher={IEEE}
}
11 changes: 5 additions & 6 deletions examples/pyw_hnswlib.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ def __init__(self, space, dim):
self.dict_labels = {}
self.cur_ind = 0

def init_index(self, max_elements, ef_construction = 200, M = 16):
self.index.init_index(max_elements = max_elements, ef_construction = ef_construction, M = M)
def init_index(self, max_elements, ef_construction=200, M=16):
self.index.init_index(max_elements=max_elements, ef_construction=ef_construction, M=M)

def add_items(self, data, ids=None):
if ids is not None:
Expand Down Expand Up @@ -55,8 +55,7 @@ def knn_query(self, data, k=1):
labels_int, distances = self.index.knn_query(data=data, k=k)
labels = []
for li in labels_int:
line = []
for l in li:
line.append(self.dict_labels[l])
labels.append(line)
labels.append(
[self.dict_labels[l] for l in li]
)
return labels, distances
22 changes: 10 additions & 12 deletions hnswlib/hnswalg.h
Original file line number Diff line number Diff line change
Expand Up @@ -573,29 +573,23 @@ namespace hnswlib {
visited_list_pool_ = new VisitedListPool(1, new_max_elements);



element_levels_.resize(new_max_elements);

std::vector<std::mutex>(new_max_elements).swap(link_list_locks_);

// Reallocate base layer
char * data_level0_memory_new = (char *) malloc(new_max_elements * size_data_per_element_);
char * data_level0_memory_new = (char *) realloc(data_level0_memory_, new_max_elements * size_data_per_element_);
if (data_level0_memory_new == nullptr)
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate base layer");
memcpy(data_level0_memory_new, data_level0_memory_,cur_element_count * size_data_per_element_);
free(data_level0_memory_);
data_level0_memory_=data_level0_memory_new;
data_level0_memory_ = data_level0_memory_new;

// Reallocate all other layers
char ** linkLists_new = (char **) malloc(sizeof(void *) * new_max_elements);
char ** linkLists_new = (char **) realloc(linkLists_, sizeof(void *) * new_max_elements);
if (linkLists_new == nullptr)
throw std::runtime_error("Not enough memory: resizeIndex failed to allocate other layers");
memcpy(linkLists_new, linkLists_,cur_element_count * sizeof(void *));
free(linkLists_);
linkLists_=linkLists_new;

max_elements_=new_max_elements;
linkLists_ = linkLists_new;

max_elements_ = new_max_elements;
}

void saveIndex(const std::string &location) {
Expand Down Expand Up @@ -987,11 +981,15 @@ namespace hnswlib {
auto search = label_lookup_.find(label);
if (search != label_lookup_.end()) {
tableint existingInternalId = search->second;

templock_curr.unlock();

std::unique_lock <std::mutex> lock_el_update(link_list_update_locks_[(existingInternalId & (max_update_element_locks - 1))]);

if (isMarkedDeleted(existingInternalId)) {
unmarkDeletedInternal(existingInternalId);
}
updatePoint(data_point, existingInternalId, 1.0);

return existingInternalId;
}

Expand Down
3 changes: 3 additions & 0 deletions python_bindings/bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ class Index {
else if(space_name=="cosine") {
l2space = new hnswlib::InnerProductSpace(dim);
normalize=true;
} else {
throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");
}
appr_alg = NULL;
ep_added = true;
Expand Down Expand Up @@ -162,6 +164,7 @@ class Index {
}
appr_alg = new hnswlib::HierarchicalNSW<dist_t>(l2space, path_to_index, false, max_elements);
cur_l = appr_alg->cur_element_count;
index_inited = true;
}

void normalize_vector(float *data, float *norm_array){
Expand Down
8 changes: 4 additions & 4 deletions python_bindings/tests/bindings_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ def testRandomSelf(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements = num_elements, ef_construction = 100, M = 16)
p.init_index(max_elements=num_elements, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
Expand All @@ -51,7 +51,7 @@ def testRandomSelf(self):
p.save_index(index_path)
del p

# Reiniting, loading the index
# Re-initiating, loading the index
p = hnswlib.Index(space='l2', dim=dim) # you can change the sa

print("\nLoading index from '%s'\n" % index_path)
Expand Down
4 changes: 2 additions & 2 deletions python_bindings/tests/bindings_test_getdata.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ def testGettingItems(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

Expand Down
16 changes: 8 additions & 8 deletions python_bindings/tests/bindings_test_labels.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ def testRandomSelf(self):
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initiating index
# max_elements - the maximum number of elements, should be known beforehand
# (probably will be made optional in the future)
#
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data
# stronlgy affects the memory consumption
# strongly affects the memory consumption

p.init_index(max_elements=num_elements, ef_construction=100, M=16)

Expand All @@ -47,7 +47,7 @@ def testRandomSelf(self):
# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)

items=p.get_items(labels)
items = p.get_items(labels)

# Check the recall:
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data1))), 1.0, 3)
Expand All @@ -67,8 +67,8 @@ def testRandomSelf(self):
print("Deleted")

print("\n**** Mark delete test ****\n")
# Reiniting, loading the index
print("Reiniting")
# Re-initiating, loading the index
print("Re-initiating")
p = hnswlib.Index(space='l2', dim=dim)

print("\nLoading index from '%s'\n" % index_path)
Expand All @@ -80,17 +80,17 @@ def testRandomSelf(self):

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
items=p.get_items(labels)
items = p.get_items(labels)

# Check the recall:
self.assertAlmostEqual(np.mean(labels.reshape(-1) == np.arange(len(data))), 1.0, 3)

# Check that the returned element data is correct:
diff_with_gt_labels=np.mean(np.abs(data-items))
diff_with_gt_labels = np.mean(np.abs(data-items))
self.assertAlmostEqual(diff_with_gt_labels, 0, delta=1e-4) # deleting index.

# Checking that all labels are returned correctly:
sorted_labels=sorted(p.get_ids_list())
sorted_labels = sorted(p.get_ids_list())
self.assertEqual(np.sum(~np.asarray(sorted_labels) == np.asarray(range(num_elements))), 0)

# Delete data1
Expand Down
Loading

0 comments on commit 1866a1d

Please sign in to comment.