-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index sections #165
Index sections #165
Conversation
Thejas-bhat
commented
Jul 18, 2023
•
edited
Loading
edited
- Introduces the concept of section which talks about reformating the index file as a set of sections each section corresponding to a certain type of index.
- Refactors the inverted index data i.e. the term dictionary, postings list etc. as part of a section.
work in progress
68b8c51
to
35abf14
Compare
bb80862
to
69ff4fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple concerns around API signatures and offset expectations.
numDocs: numDocs, | ||
storedIndexOffset: storedIndexOffset, | ||
fieldsIndexOffset: sectionsIndexOffset, | ||
sectionsIndexOffset: sectionsIndexOffset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the fieldsIndexOffset and sectionsIndexOffset are the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
essentially sections index content contains the same as fields index + sections information. so, for the fieldsIndexOffset which gives info about the fields in a segment we just consult the sectionsIndexOffset to get that info. Also, the field is needed for backward compatibility purposes as well.
@@ -74,14 +73,14 @@ func mergeSegmentBases(segmentBases []*SegmentBase, drops []*roaring.Bitmap, pat | |||
// wrap it for counting (tracking offsets) | |||
cr := NewCountHashWriterWithStatsReporter(br, s) | |||
|
|||
newDocNums, numDocs, storedIndexOffset, fieldsIndexOffset, docValueOffset, _, _, _, err := | |||
newDocNums, numDocs, storedIndexOffset, fieldsIndexOffset, docValueOffset, _, _, _, sectionsIndexOffset, err := | |||
MergeToWriter(segmentBases, drops, chunkMode, cr, closeCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try and keep the signature of InitSegmentBase and MergeToWriter similar in terms of the index offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean interm.convert()
's signature over here?
a384c5b
to
b8f406e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unit test TestMergeWithEmptySegment seems to fail for me with these new changes - must be an issue on my end since it passes in CI
// as part of the merge API, write the merged data to the writer and also track | ||
// the starting offset of this newly merged section data. | ||
Merge(opaque map[int]resetable, segments []*SegmentBase, drops []*roaring.Bitmap, fieldsInv []string, | ||
newDocNumsIn [][]uint64, w *CountHashWriter, closeCh chan struct{}) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newDocNumsIn should also have uint32 no? like newDocNums[3][4] = x will mean the new Doc Num for a document in segment 3 with docnumber 4 will be x. So x has to be uint32 only.
* initial sections code for dense vector index --------- Co-authored-by: Abhi Dangeti <[email protected]>
the main reason why I kept the docNum's type to be uint64 was to ensure that we don't break anything around the existing code - I was being conservative to make additional changes to the existing function signatures etc. do you all think its better to finalize whether to use uint32 or uint64 for docNum (after testing it better for eg) in a later PR? or do you want me to include this in the current PR? |
@Thejas-bhat I pushed that last commit up to unbreak the build. The type of the faissIndexSection's Process didn't match with the section interface's Process. We can talk about changing the type later if really needed. |
* index sections * faiss index section + inverted text index section Co-authored-by: Marty Schoch <[email protected]> Co-authored-by: Abhi Dangeti <[email protected]>