Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Update][Ease of Use] Numpy everywhere #2665

Open
atroyn opened this issue Aug 14, 2024 · 3 comments · May be fixed by #2803
Open

[Update][Ease of Use] Numpy everywhere #2665

atroyn opened this issue Aug 14, 2024 · 3 comments · May be fixed by #2803
Assignees
Labels
cleanup Local Chroma An improvement to Local (single node) Chroma

Comments

@atroyn
Copy link
Contributor

atroyn commented Aug 14, 2024

Numpy Everywhere

Right now we convert from numpy types and back throughout the code. Besides being messy, this also creates additional memory overhead which can significantly impact performance when ingesting data, because it may cause a user's system to start hard swapping.

From @codetheweb's profiling;

import chromadb
import numpy as np

NUM_RECORDS = 4096 * 4 

ids = [str(i) for i in range(NUM_RECORDS)]
embeddings = np.random.rand(NUM_RECORDS, 384)
metadatas = [{"name": f"metadata_{i}"} for i in range(NUM_RECORDS)]

client = chromadb.Client()
collection = client.create_collection("test_collection")

collection.add(ids, embeddings)

When adding 16,384 embeddings, the memory used at the end of Collection.add() is 594 MiB. When adding 32,768 embeddings, the memory used at the end of Collection.add() is 996 MiB. Thus, an additional 25 KiB is used per 384 dimension embedding during inserts. The minimum byte size of a 384 f32 embedding is 4 * 384 = 1.5 KiB, so there’s about 16x more memory used than the theoretical minimum.

Profiling (https://pypi.org/project/memory-profiler/) revealed that:

We stand to gain a lot and don't lose much / anything by sticking to a numpy representation of embeddings.

@atroyn atroyn added this to the Local Chroma v.0.6 milestone Aug 14, 2024
@atroyn atroyn added Local Chroma An improvement to Local (single node) Chroma cleanup labels Aug 14, 2024
@atroyn
Copy link
Contributor Author

atroyn commented Aug 14, 2024

This is likely related to / coupled to #2292

@levand
Copy link
Contributor

levand commented Aug 19, 2024

This also might be a good opportunity to improve HTTP performance by using a serialization format other than JSON, which we know to be quite inefficient for large arrays of floating point numbers.

If it's possible to use a compact serialization that can be converted efficiently to Numpy arrays, it would improve performance quite a bit across the board.

@atroyn
Copy link
Contributor Author

atroyn commented Aug 19, 2024

This is a good point, I think we will punt it out of this milestone, but we should create an issue for it. @levand could you please do that and link it here?

@atroyn atroyn assigned atroyn and drewkim and unassigned atroyn Sep 9, 2024
@codetheweb codetheweb linked a pull request Sep 17, 2024 that will close this issue
1 task
@drewkim drewkim linked a pull request Sep 18, 2024 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Local Chroma An improvement to Local (single node) Chroma
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants