Handle zero chunk size #264

moshaad7 · 2024-09-12T11:18:23Z

Backport of #263

* index sections * faiss index section + inverted text index section Co-authored-by: Marty Schoch <[email protected]> Co-authored-by: Abhi Dangeti <[email protected]>

* index.Delete() to free up memory * refactoring index creation for IVF type of indexes * absorbing API changes from go-faiss

cleaning up index.Close() errors

To absorb: * 0ea762e Abhi Dangeti | Temporarily revert free-ing C pointers/buffers - adding a TODO for this (#7)

* e5f7515 Thejas-bhat | bug fix: copying the serialized content from C heap to go (#8)

…#182)

…document (#181) * avoiding bitmaps when the vector is present only in 1 doc * minor refactor of the code * code comment

Co-authored-by: Abhi Dangeti <[email protected]>

* using IVF instead of flat for smaller indexes * unit test fix * cleanup; bug fix: track erro from add_with_ids

* command line support for vector section - initial commit * docvalue tool fixes * bug fix: docvalue cmd returning nil result * refactoring fields cmd tool * code cleanup

* choosing the index types more responsibly * code cleanup; tests fix

* bug fix: correcting the valid vector ids being tracked during merge * fixing the decoding segment data in vector cmd tool * bug fix: shift the docNum by 32 to account for signed score vals

* bug fix: correcting the valid vector ids being tracked during merge * fixing the decoding segment data in vector cmd tool * bug fix: shift the docNum by 32 to account for signed score vals * resetting vec datastructures on a opaque.Reset() * tracking the number of vecs for future allocations * allocateSpace -> realloc * Update section_faiss_vector_index.go, segment.go --------- Co-authored-by: Abhi Dangeti <[email protected]>

* separate functions to read and search vector indexes. * minor update * addressed reviews * common func for doc values

…#190) * Improve search path for multi kNN and also single kNN Single kNN would've regressed a bit because of the earlier 2-API maneuver introduced with #184. * Add some commentary around InterpretVectorIndex

…es (#195)

…ed (#196) bug fixes: using seg.ErrClosed when the merge is aborted

…type. (#197) * switch to array when deleting vectors * minor UT improvements * redo the merge operation * comment and UT fixes * added back reconstruction method for special cases * minor improvements * fix just for IVF indexes * snake case to camel case * train only IVF indexes, not flat

* support nested/chunked vectors * improve commentary * address review comments * Conflict resolution --------- Co-authored-by: Abhinav Dangeti <[email protected]>

bug fix: fixing out of bound slice errors

…199) * refactoring vector index metadata tracking * tracking only the relevant segments while merging * minor code cleanup

dropping the unnecessary masking bits; unit test fix

* reconstruction for all index types * correct ordering of vector IDs * remove unused code * fixed commentary

* changing defaults for nlist and nprobe * address review * addressed reviews * better func naming * Minor optimizations, fix naming, upgrade to blevesearch/[email protected] --------- Co-authored-by: Abhinav Dangeti <[email protected]>

…ch (#205) This PR addresses filtering of deleted documents by moving from post-filtering to search time filtering. Passes doc IDs to exclude for each search. To account for duplicate vectors: saves every vector with a unique hash so that the selector only excludes the specific vector ID. The current approach of identical vectors having identical hashes would lead to (unnecessarily) excluding vectors if other (identical) vectors in the segment were deleted. * Upgrade to [email protected] --------- Co-authored-by: Abhinav Dangeti <[email protected]>

- accommodates user choice of optimising for speed/recall when creating an index - it does this by changing the nprobe count based on the index type. - persists this so that this is adhered to during the merge process as well. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

* when the inverted index section processes the vector fields as well, there are some amount of allocations possible for content which its not supposed to process (for example vector content. This PR fixes that particular short-sight. * during the merge process for every field we are currently allocating a couple of slices of capacity = number of segments. these allocations can be reused * currently the freeing of the reconstructed vector indexes (which can be large in high volume) is a defer logic. This holds up the memory on C heap until the end of function before which there are expensive steps of train and add to the merged index - which will be detrimental overall particularly while introducing the segments as well in high data ingestion scenarios.

+ This change will hugely save compute in iterating through a potentially massive vecDocIDMap on every search call. + Pivoting to populating vectorIDsToExclude during the segment interpret operation - just the one time where we build the vecDocIDMap. Requires: blevesearch/scorch_segment_api#40

add coarse quantiser

…232) On account of a regression showcased with MB-61470. This reverts commit 9e2514f.

- Uses the exponentially weighted moving average to get the average hits on a particular field (thereby a particular vector index). The index stays in the memory when the average of the hits is above a certain threshold and below which the index is closed and the memory is freed for reuse on C side of things. - This ensures that we are not keeping the index in the memory even when there is no query workload on the field in a segment so the memory pressure does get reduced on the C side of operations. --------- Co-authored-by: Likith B <[email protected]> Co-authored-by: Abhinav Dangeti <[email protected]>

* minor optimizations and bug fixes * resolve comment

- Generalised some of the cache function names to be inclusive of the map - Added the map to the cache which will behave the same as the index - Except bitmap logic is not part of the cache and the vecs excluded is calculated outside of the map --------- Co-authored-by: Abhinav Dangeti <[email protected]>

To include: * 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's batchSelector to avoid memory leak

Includes: * d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of 'long long' type (#24) * 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for complete-ness

- Defer the stopping of the monitor routine's ticker to release the ticker's resources

Requires: - blevesearch/faiss#17 - blevesearch/go-faiss#26 (Upgrade to blevesearch/[email protected]) --------- Co-authored-by: Abhinav Dangeti <[email protected]>

bug fix: de-duplicating fieldsInv entry for a segment

* updated quantiser * Upgrade bleve_index_api, scorch_segment_api Brings in: * f4827a8 Aditi Ahuja | MB-60943 - Add option for memory-optimised vector indexes. --------- Co-authored-by: Abhinav Dangeti <[email protected]>

Refactors IndexOptimizedForMemory -> IndexOptimizedForMemoryEfficient

* MB-62167: Fix windows crash * vectorIndexIOFlags -> faissIOFlags

…of OMP Threads to be 1 (#247) * Use the go-faiss OpenMP API for setting the default number of OMP Threads to be 1 * Upgrade to [email protected] * ec45499 Rahul Rampure | MB-61930: Improve threading performance --------- Co-authored-by: Abhinav Dangeti <[email protected]>

Brings in: * 7531ec8 Rahul Rampure | MB-62221: Fix platform specific behaviour

Requires blevesearch/faiss@d9db66a

* MB-61889: support search with params * Upgrade to [email protected] & [email protected] * Upgrade to blevesearch/[email protected] * 88dd5e2 Mohd Shaad Khan | Mb 61889: support search with params --------- Co-authored-by: Abhinav Dangeti <[email protected]>

#252) * handling buffer overflow while creating new segmentBase * cleaning up code instrumentation * more comments around the edge case

* Set FAISS vector metric to be `MetricInnerProduct`, if the distance metric from bleve is either `Cosine` or `InnerProduct` * Upgrade to [email protected] --------- Co-authored-by: Abhinav Dangeti <[email protected]>

* 1bb080b Abhi Dangeti | MB-61889: Address corner case with ivf_nprobe_pct

*Accommodates filtered doc IDs, if required, within a kNN search over a vector index. * Builds and caches a document to vector ID map for looking up vector IDs of the filtered doc IDs. * Account for nested vectors * Upgrade bleve_index_api, scorch_segment_api, go-faiss & workflows --------- Co-authored-by: Abhinav Dangeti <[email protected]>

The existing implementation of the method getChunkSize(...) (uint64, error), in some cases, can return chunkSize==0 and err==nil. The onus is on the caller to check for such possibility and handle it properly. Callers often use the returned chunkSize as a divisor and a zero chunkSize lead to panic. see #209 This PR intends to update the method implementation to always return an error in case the returned chunkSize value is 0. That way caller need to only worry about error being non-nil. Callers which are ok with 0 chunkSize can check the returned error against ErrChunkSizeZero

Thejas-bhat and others added 30 commits September 12, 2024 16:46

Index sections (#165)

7b13491

* index sections * faiss index section + inverted text index section Co-authored-by: Marty Schoch <[email protected]> Co-authored-by: Abhi Dangeti <[email protected]>

memory leak fixes (#176)

7b302a5

* index.Delete() to free up memory * refactoring index creation for IVF type of indexes * absorbing API changes from go-faiss

bug fix: reconstruct failing due to key not found (#177)

c7c5cc5

cleaning up index.Close() errors

Use latest of blevesearch/go-faiss (#178)

b8c2117

To absorb: * 0ea762e Abhi Dangeti | Temporarily revert free-ing C pointers/buffers - adding a TODO for this (#7)

MB-59569: Upgrade to blevesearch/[email protected] (#183)

badde32

* e5f7515 Thejas-bhat | bug fix: copying the serialized content from C heap to go (#8)

MB-59692: Respect choice of similarity metric, [email protected] (…

bbe6b87

…#182)

optimization: avoiding storing of bitmaps when vector is in a single …

2ff712a

…document (#181) * avoiding bitmaps when the vector is present only in 1 doc * minor refactor of the code * code comment

bug fix: accounting the vecIDs approriately while merging (#179)

24c8620

Co-authored-by: Abhi Dangeti <[email protected]>

replacing the flat index with the IVF family index. (#180)

225d0b0

* using IVF instead of flat for smaller indexes * unit test fix * cleanup; bug fix: track erro from add_with_ids

command line tool support for vector search (#175)

cd8af4c

* command line support for vector section - initial commit * docvalue tool fixes * bug fix: docvalue cmd returning nil result * refactoring fields cmd tool * code cleanup

choosing the index types more responsibly (#186)

c182fcb

* choosing the index types more responsibly * code cleanup; tests fix

fixing IO stats computation (#188)

9be5eb5

bug fix: fixing the logic of tracking vector ids during a merge (#187)

d176983

* bug fix: correcting the valid vector ids being tracked during merge * fixing the decoding segment data in vector cmd tool * bug fix: shift the docNum by 32 to account for signed score vals

Separate functions to read and search vector indexes. (#184)

6e8258a

* separate functions to read and search vector indexes. * minor update * addressed reviews * common func for doc values

Improve search path for multi kNN and more necessarily for single kNN (…

e11ed23

…#190) * Improve search path for multi kNN and also single kNN Single kNN would've regressed a bit because of the earlier 2-API maneuver introduced with #184. * Add some commentary around InterpretVectorIndex

Graceful exit when min arguments not provided for vector (#193)

b9a5731

Beautify & Fix cmd line tooling output for vector content (#194)

8d58766

MB-60068: Case when dimensionality of index and query vector mismatch…

d8031a6

…es (#195)

MB-60071 - using the appropriate error value when the closeCh is clos…

972e54b

…ed (#196) bug fixes: using seg.ErrClosed when the merge is aborted

MB-59447: support nested/chunked vectors (#185)

4c4c02d

* support nested/chunked vectors * improve commentary * address review comments * Conflict resolution --------- Co-authored-by: Abhinav Dangeti <[email protected]>

MB-60071: fixing out of bound slice access panic (#198)

335bab0

bug fix: fixing out of bound slice errors

MB-60152: Panic fix while trying to fetch metadata of a vector index (#…

b31d5e8

…199) * refactoring vector index metadata tracking * tracking only the relevant segments while merging * minor code cleanup

TestVecPostingsIterator unit test fix (#200)

7a95724

dropping the unnecessary masking bits; unit test fix

MB-60269 - Merge path fixes (#202)

71c6f00

* reconstruction for all index types * correct ordering of vector IDs * remove unused code * fixed commentary

abhinavdangeti and others added 28 commits September 12, 2024 16:46

MB-60943 - Add a coarse quantiser to the IVF indexes (#225)

f72c611

add coarse quantiser

Revert "MB-60943 - Add a coarse quantiser to the IVF indexes (#225)" (#…

c4f4234

…232) On account of a regression showcased with MB-61470. This reverts commit 9e2514f.

minor optimizations and bug fixes (#233)

0aa74ef

* minor optimizations and bug fixes * resolve comment

MB-61650: Upgrade blevesearch/go-faiss (#236)

6a955fa

To include: * 693b06a Rahul Rampure | MB-61650: Release IDSelectorBatch's batchSelector to avoid memory leak

MB-60697: Upgrade blevesearch/go-faiss for windows fix (#238)

3d28f8e

Includes: * d8f2ddf Abhi Dangeti | MB-60697: Windows requires nprobe to be of 'long long' type (#24) * 9bb55f8 Abhi Dangeti | Retain IDSelector's Delete() API for complete-ness

add map capacity (#237)

a8febb7

MB-60943 - Reduce number of centroids for IVF indexes. (#234)

52342b1

Stop the ticker in vector cache (#240)

d16a7e9

- Defer the stopping of the monitor routine's ticker to release the ticker's resources

Move up to [email protected] & [email protected] (#242)

9d26375

MB-59575: Handling memory mapped content more responsibly (#241)

164c638

Requires: - blevesearch/faiss#17 - blevesearch/go-faiss#26 (Upgrade to blevesearch/[email protected]) --------- Co-authored-by: Abhinav Dangeti <[email protected]>

bug fix: de-duplicating fieldsInv value (#239)

f3fb79e

bug fix: de-duplicating fieldsInv entry for a segment

MB-60943: Index optimization refactor memory -> memory-efficient (#245)

7860ad1

Refactors IndexOptimizedForMemory -> IndexOptimizedForMemoryEfficient

MB-62167: Fix windows crash (#246)

b6b0428

* MB-62167: Fix windows crash * vectorIndexIOFlags -> faissIOFlags

MB-62221: Upgrade go-faiss (#248)

ab311ca

Brings in: * 7531ec8 Rahul Rampure | MB-62221: Fix platform specific behaviour

MB-62221: Upgrade go-faiss to v1.0.19 (#249)

75dad85

Requires blevesearch/faiss@d9db66a

MB-62427: seeking the pointer correctly while reading segmentBase data (

af88577

#252) * handling buffer overflow while creating new segmentBase * cleaning up code instrumentation * more comments around the edge case

MB-61889: fix test cases (#253)

1b549a1

MB-62354: Support Cosine Similarity (#251)

90e0ee6

* Set FAISS vector metric to be `MetricInnerProduct`, if the distance metric from bleve is either `Cosine` or `InnerProduct` * Upgrade to [email protected] --------- Co-authored-by: Abhinav Dangeti <[email protected]>

MB-61889: Upgrade go-faiss for corner case fix (#260)

e2e3b77

* 1bb080b Abhi Dangeti | MB-61889: Address corner case with ivf_nprobe_pct

Fix unit test breakage from previous commit (API change) (#262)

4d44092

moshaad7 closed this Sep 12, 2024

moshaad7 deleted the handle_zero_chunk_size_BACKPORT branch September 12, 2024 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle zero chunk size #264

Handle zero chunk size #264

moshaad7 commented Sep 12, 2024

Handle zero chunk size #264

Handle zero chunk size #264

Conversation

moshaad7 commented Sep 12, 2024