[Update][Ease of Use] Numpy everywhere #2665

atroyn · 2024-08-14T19:59:49Z

Numpy Everywhere

Right now we convert from numpy types and back throughout the code. Besides being messy, this also creates additional memory overhead which can significantly impact performance when ingesting data, because it may cause a user's system to start hard swapping.

From @codetheweb's profiling;

import chromadb
import numpy as np

NUM_RECORDS = 4096 * 4 

ids = [str(i) for i in range(NUM_RECORDS)]
embeddings = np.random.rand(NUM_RECORDS, 384)
metadatas = [{"name": f"metadata_{i}"} for i in range(NUM_RECORDS)]

client = chromadb.Client()
collection = client.create_collection("test_collection")

collection.add(ids, embeddings)

When adding 16,384 embeddings, the memory used at the end of Collection.add() is 594 MiB. When adding 32,768 embeddings, the memory used at the end of Collection.add() is 996 MiB. Thus, an additional 25 KiB is used per 384 dimension embedding during inserts. The minimum byte size of a 384 f32 embedding is 4 * 384 = 1.5 KiB, so there’s about 16x more memory used than the theoretical minimum.

Profiling (https://pypi.org/project/memory-profiler/) revealed that:

converting numpy arrays to Python lists accounts for 15 KiB per embedding alone

chroma/chromadb/api/models/CollectionCommon.py

Line 559 in f66b47d

return embeddings.tolist() # type: ignore
submitting the embeddings to the log and notifying subscribers accounts for 6.4 KiB per embedding

chroma/chromadb/api/segment.py

Line 358 in f66b47d

self._producer.submit_embeddings(collection_id, records_to_submit)

We stand to gain a lot and don't lose much / anything by sticking to a numpy representation of embeddings.

The text was updated successfully, but these errors were encountered:

atroyn · 2024-08-14T20:04:57Z

This is likely related to / coupled to #2292

levand · 2024-08-19T21:19:47Z

This also might be a good opportunity to improve HTTP performance by using a serialization format other than JSON, which we know to be quite inefficient for large arrays of floating point numbers.

If it's possible to use a compact serialization that can be converted efficiently to Numpy arrays, it would improve performance quite a bit across the board.

atroyn · 2024-08-19T21:50:40Z

This is a good point, I think we will punt it out of this milestone, but we should create an issue for it. @levand could you please do that and link it here?

atroyn added this to the Local Chroma v.0.6 milestone Aug 14, 2024

atroyn added Local Chroma An improvement to Local (single node) Chroma cleanup labels Aug 14, 2024

levand mentioned this issue Aug 19, 2024

[PERF] Add a more efficient HTTP serialization #2681

Open

atroyn assigned atroyn and drewkim and unassigned atroyn Sep 9, 2024

codetheweb linked a pull request Sep 17, 2024 that will close this issue

[PERF] Convert embeddings representation to numpy #2803

Open

1 task

drewkim linked a pull request Sep 18, 2024 that will close this issue

[PERF] Convert embeddings representation to numpy #2803

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update][Ease of Use] Numpy everywhere #2665

[Update][Ease of Use] Numpy everywhere #2665

atroyn commented Aug 14, 2024

atroyn commented Aug 14, 2024

levand commented Aug 19, 2024

atroyn commented Aug 19, 2024

[Update][Ease of Use] Numpy everywhere #2665

[Update][Ease of Use] Numpy everywhere #2665

Comments

atroyn commented Aug 14, 2024

Numpy Everywhere

atroyn commented Aug 14, 2024

levand commented Aug 19, 2024

atroyn commented Aug 19, 2024