Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anserini: replace verbose json-based vector format with more compact binary encoding #31

Open
lintool opened this issue Mar 21, 2024 · 14 comments

Comments

@lintool
Copy link
Member

lintool commented Mar 21, 2024

Currently, for HNSW indexing in Anserini, we're reading a very verbose json text-based format, which is very inefficient. We want to replace with a more efficient binary encoding.

Additional background:

safetensors seems like the best bet.

If you want to work on this task, get started by doing the BEIR regressions here: https://github.com/castorini/anserini?tab=readme-ov-file#%EF%B8%8F-end-to-end-regression-experiments

In particular, do the BGE regressions on NFcorpus, which aligns with the onboarding exercise. If your personal machine isn't big enough to run the regression, the student linux environment should be sufficient.

@lintool
Copy link
Member Author

lintool commented Mar 21, 2024

Here are the steps I see to accomplishing this task:

  1. Understand the problem, repro BGE/NFcorpus based on description above.
  2. Decide on a technical solution. I'm leaning to safetensors right now but open to discussion.
  3. Take the json NFcorpus and re-encode in safetensors - play with encoding/encoding to make sure you can convert data in a lossless way. The encoding can be done in Python, but the decoding has to be done from Java (since ultimately the indexing has to be done from Java).
  4. Rewrite the current HNSW indexer to use the new binary format instead of the json text format.

@Panizghi
Copy link

Panizghi commented Mar 31, 2024

This is the base encoding, I am trying to first d all the data conversion within python first and then move everything to for integration testing between java and python with help of @17Melissa :
https://colab.research.google.com/drive/1hgdTtRyT3NcpZolT9OIPJsmGqTqureIm#scrollTo=fQ0WlHYir8o2

@lintool
Copy link
Member Author

lintool commented Mar 31, 2024

@Panizghi good start. IIRC, @arjenpdevries 's suggestion was one safetensor for the actual doc vectors, and another one for the docids. Put both into a single directory?

@Panizghi
Copy link

Panizghi commented Apr 1, 2024

Defiantly looking into in such case do we want mapping between the tensors or just two sets of separated tensors?

@lintool
Copy link
Member Author

lintool commented Apr 1, 2024

Just two separate sets for now.

@17Melissa
Copy link

just tried saving docids and vectors from vectors.part00.jsonl into safetensors, organized in one directory with help of @Panizghi:

https://colab.research.google.com/drive/1uP5PDdplQDBp_Pd7lyh4FB5qKym-E-hR#scrollTo=x-EDkEw-yaSO

@Panizghi
Copy link

I have a base draft for the DocumentGenerator & Collection castorini/anserini@ff5aea7. I don't think using Jython is the best approach or not however for testing it might not be the worst for now :)

PS: There bit roundabout around the script before it can be tested but I thought in meantime it would be great if I can get some feedback!

@lintool
Copy link
Member Author

lintool commented Apr 24, 2024

hi @Panizghi thanks for pushing this forward. Can we not introduce Jython as a new dependency and write this in pure Java?

@Panizghi
Copy link

Hi Sure thing, Yes absolutely I am trying my best write the interpreter myself and not introducing anything new :) @17Melissa Gave me new insight that I will add it and keep you posted soon!

@Panizghi
Copy link

Panizghi commented Jun 2, 2024

Hi! The PR is open on castorini/anserini#2515 the test we tried was with nfcorpus.
The logic that I tried to implement to be integrated to anerini is coming from https://github.com/Panizghi/SafeTensorDeserializer :)

@lintool
Copy link
Member Author

lintool commented Aug 27, 2024

@valamuri2020 is also working on this.

Should we also consider https://parquet.apache.org/ as an alternative format to safetensors?

@valamuri2020
Copy link
Member

Parquet format implementation: castorini/anserini#2582

nfcorpus data is 17.5 MB with parquet format, 21.2 MB with safetensors

@arjenpdevries
Copy link

That's pretty cool - is it due to much better compression happening in the parquet writer?

@lintool
Copy link
Member Author

lintool commented Sep 12, 2024

Okay @valamuri2020 one more wrinkle. The current jsonl files you're working with were originally created from Faiss in the following pipeline by @MXueguang : Faiss -> jsonl -> parquet.

This might be lossy, so I'd like you to write a converter directly from Faiss, so i.e., read from Faiss, write to parquet.

Then feed into existing pipeline.

The "ground truth" Faiss indexes are here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L4437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants