Anserini: replace verbose json-based vector format with more compact binary encoding #31

lintool · 2024-03-21T17:26:26Z

Currently, for HNSW indexing in Anserini, we're reading a very verbose json text-based format, which is very inefficient. We want to replace with a more efficient binary encoding.

Additional background:

safetensors seems like the best bet.

If you want to work on this task, get started by doing the BEIR regressions here: https://github.com/castorini/anserini?tab=readme-ov-file#%EF%B8%8F-end-to-end-regression-experiments

In particular, do the BGE regressions on NFcorpus, which aligns with the onboarding exercise. If your personal machine isn't big enough to run the regression, the student linux environment should be sufficient.

The text was updated successfully, but these errors were encountered:

lintool · 2024-03-21T17:52:22Z

Here are the steps I see to accomplishing this task:

Understand the problem, repro BGE/NFcorpus based on description above.
Decide on a technical solution. I'm leaning to safetensors right now but open to discussion.
Take the json NFcorpus and re-encode in safetensors - play with encoding/encoding to make sure you can convert data in a lossless way. The encoding can be done in Python, but the decoding has to be done from Java (since ultimately the indexing has to be done from Java).
Rewrite the current HNSW indexer to use the new binary format instead of the json text format.

Panizghi · 2024-03-31T19:40:43Z

This is the base encoding, I am trying to first d all the data conversion within python first and then move everything to for integration testing between java and python with help of @17Melissa :
https://colab.research.google.com/drive/1hgdTtRyT3NcpZolT9OIPJsmGqTqureIm#scrollTo=fQ0WlHYir8o2

lintool · 2024-03-31T20:05:26Z

@Panizghi good start. IIRC, @arjenpdevries 's suggestion was one safetensor for the actual doc vectors, and another one for the docids. Put both into a single directory?

Panizghi · 2024-04-01T03:05:01Z

Defiantly looking into in such case do we want mapping between the tensors or just two sets of separated tensors?

lintool · 2024-04-01T11:27:08Z

Just two separate sets for now.

17Melissa · 2024-04-02T17:19:45Z

just tried saving docids and vectors from vectors.part00.jsonl into safetensors, organized in one directory with help of @Panizghi:

https://colab.research.google.com/drive/1uP5PDdplQDBp_Pd7lyh4FB5qKym-E-hR#scrollTo=x-EDkEw-yaSO

Panizghi · 2024-04-24T02:20:54Z

I have a base draft for the DocumentGenerator & Collection castorini/anserini@ff5aea7. I don't think using Jython is the best approach or not however for testing it might not be the worst for now :)

PS: There bit roundabout around the script before it can be tested but I thought in meantime it would be great if I can get some feedback!

lintool · 2024-04-24T02:22:56Z

hi @Panizghi thanks for pushing this forward. Can we not introduce Jython as a new dependency and write this in pure Java?

Panizghi · 2024-04-25T00:28:37Z

Hi Sure thing, Yes absolutely I am trying my best write the interpreter myself and not introducing anything new :) @17Melissa Gave me new insight that I will add it and keep you posted soon!

Panizghi · 2024-06-02T05:09:19Z

Hi! The PR is open on castorini/anserini#2515 the test we tried was with nfcorpus.
The logic that I tried to implement to be integrated to anerini is coming from https://github.com/Panizghi/SafeTensorDeserializer :)

lintool · 2024-08-27T17:54:50Z

@valamuri2020 is also working on this.

Should we also consider https://parquet.apache.org/ as an alternative format to safetensors?

valamuri2020 · 2024-08-29T23:29:19Z

Parquet format implementation: castorini/anserini#2582

nfcorpus data is 17.5 MB with parquet format, 21.2 MB with safetensors

arjenpdevries · 2024-08-30T10:36:26Z

That's pretty cool - is it due to much better compression happening in the parquet writer?

lintool · 2024-09-12T19:44:12Z

Okay @valamuri2020 one more wrinkle. The current jsonl files you're working with were originally created from Faiss in the following pipeline by @MXueguang : Faiss -> jsonl -> parquet.

This might be lossy, so I'd like you to write a converter directly from Faiss, so i.e., read from Faiss, write to parquet.

Then feed into existing pipeline.

The "ground truth" Faiss indexes are here: https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py#L4437

lintool mentioned this issue Mar 23, 2024

Conversion between different vector formats #3

Closed

Panizghi mentioned this issue Jun 2, 2024

HnswDensevector SafeTensor Generator castorini/anserini#2515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anserini: replace verbose json-based vector format with more compact binary encoding #31

Anserini: replace verbose json-based vector format with more compact binary encoding #31

lintool commented Mar 21, 2024

lintool commented Mar 21, 2024

Panizghi commented Mar 31, 2024 •

edited

Loading

lintool commented Mar 31, 2024

Panizghi commented Apr 1, 2024

lintool commented Apr 1, 2024

17Melissa commented Apr 2, 2024

Panizghi commented Apr 24, 2024

lintool commented Apr 24, 2024

Panizghi commented Apr 25, 2024

Panizghi commented Jun 2, 2024

lintool commented Aug 27, 2024

valamuri2020 commented Aug 29, 2024

arjenpdevries commented Aug 30, 2024

lintool commented Sep 12, 2024

Anserini: replace verbose json-based vector format with more compact binary encoding #31

Anserini: replace verbose json-based vector format with more compact binary encoding #31

Comments

lintool commented Mar 21, 2024

lintool commented Mar 21, 2024

Panizghi commented Mar 31, 2024 • edited Loading

lintool commented Mar 31, 2024

Panizghi commented Apr 1, 2024

lintool commented Apr 1, 2024

17Melissa commented Apr 2, 2024

Panizghi commented Apr 24, 2024

lintool commented Apr 24, 2024

Panizghi commented Apr 25, 2024

Panizghi commented Jun 2, 2024

lintool commented Aug 27, 2024

valamuri2020 commented Aug 29, 2024

arjenpdevries commented Aug 30, 2024

lintool commented Sep 12, 2024

Panizghi commented Mar 31, 2024 •

edited

Loading