[PERF] Convert embeddings representation to numpy #2803

drewkim · 2024-09-16T16:30:06Z

Description of changes

Summarize the changes made by this PR.
Improvements & Bug fixes

Standardizes our representation of embedding vectors to numpy arrays instead of python lists

Notes

The conversion of a numpy array to a python list via .tolist() consumes an excess amount of memory, and so the focus of this PR is to reduce the number of these operations by representing embeddings as numpy arrays across the board.
- The main concern is consuming too much memory on our users' machines. thus the focus is to reduce meory usage within Local Chroma. This naturally will necessitate changes to Distributed, the approach is to only make minimal, necessary changes to the latter.
The format for Embeddings changes from List[List[Union[float, int]]] to List[NDArray[Union[int32, float32]]].
- Using a 2D numpy array instead of a Python list of 1D numpy arrays would be slightly more performant. However, it would require deep, fundamental, and widespread changes to not only our code but also testing frameworks. I have a commit that starts to go down this path (if you're interested in seeing), and it's not pretty. The changes introduce an incredible amount of risk across the board for little to no gain on the vector we are trying to improve and thus is not worth it.
  - We tested that when creating lists of numpy arrays, Python stores pointers to the arrays and doesn't consume excess memory—using a 2D array thus doesn't help us further achieve our goal of reduced memory usage.
FastAPI does not support serialization of numpy arrays. Thus, upon encoding/decoding embeddings through our middleware, we have no option but to convert back to python lists. This is acceptable, however, because these operations only happen within Distributed.
This PR introduces a breaking change to the API in that embeddings will now be returned as 2D numpy arrays. It makes sense to me both from a memory usage issue perspective (not calling .tolist() upon returning a response) and also generally provides better performance within users.
Embeddings will now be internally standardized to float32/int32 upon serialization (for RPC and byte encoding), which will could cause a little bit of noise within distance functions depending on what the user provides. We should communicate this change in the next version release.
Because of the aforementioned FastAPI serialization issue and the fact that structures like QueryResult and GetResult are threaded all the way across the stack through our middleware, there is a new risk introduced. We could fork Local/Distributed Chroma and create new structures for the former that use numpy arrays. However, this is unideal to me. I decided instead to let these structures support both numpy arrays and pythonic lists for embeddings. When were are on either side of the middleware, we must convert between numpy arrays and lists. This isn’t an issue in of itself, but makes it such that we always have to remember to perform the conversion when using the structures, lest we risk returning embeddings as pythonic lists within Distributed Chroma. New development of clients seems unlikely for now, so this feels like an acceptable risk.
Using the same profiling method, over 10 runs the average memory used was originally 362 MB, going to 264 MB with changes (representing a 28% decrease in memory usage).

Performance details

I ran the following code 10 times each on main and on my branch and averaged the results to test the performance. This was the same code as used in the original [issue](https://github.com//issues/2665)

import chromadb
import numpy as np

NUM_RECORDS = 4096 * 4 

ids = [str(i) for i in range(NUM_RECORDS)]
embeddings = np.random.rand(NUM_RECORDS, 384)
metadatas = [{"name": f"metadata_{i}"} for i in range(NUM_RECORDS)]

client = chromadb.Client()
collection = client.create_collection("test_collection")

collection.add(ids, embeddings)

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

github-actions · 2024-09-16T16:30:18Z

github-actions · 2024-09-16T16:30:20Z

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

github-actions · 2024-09-16T17:12:40Z

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

github-actions · 2024-09-16T17:45:57Z

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

codetheweb · 2024-09-17T22:36:17Z

Using the same profiling method, over 10 runs the average memory used was originally 362 MB, going to 264 MB with changes (representing a 28% decrease in memory usage).

not sure what this referencing; the profiling methodology included in a

Details

would be really helpful :)

chromadb/api/types.py

codetheweb · 2024-09-17T22:42:37Z

chromadb/ingest/__init__.py

    elif encoding == ScalarEncoding.INT32:
-        return array.array("i", vector).tolist()
+        return np.frombuffer(vector, dtype=np.float32)


maybe we should update chromadb/test/property/test_cross_version_persist.py to leave some entries in the WAL between version changes to assert that these serialization formats are compatible

I like the idea but am having trouble getting the test to pass without purging the log between runs—would spot checking manually that we can serialize as an list and deserialize as a numpy array work instead?

hmm yeah probably
also happy to pair to try to get the test working

chromadb/test/property/test_embeddings.py

chromadb/db/mixins/embeddings_queue.py

chromadb/api/segment.py

chromadb/test/property/invariants.py

codetheweb · 2024-09-17T22:55:22Z

chromadb/api/types.py

+PyEmbedding = PyVector
 Embedding = Vector


may be out of scope but the proliferation of types and indirection here is kinda confusing, wondering if just having PyVector and NpVector instead would work

I think out of scope for this PR—it would require changing quite a bit more code across the board and introduce more risk. I also think there's a good reason to keep the idea of an Embedding separate from that of a Vector. While the former is always the latter, there are usages of Vector across the code base where it is not an Embedding.

Yeah agree this is weird but I guess the distinction being made is "vector" is a array of numbers vs "embedding" is a domain specific input to chroma that may change. Which makes sense to separate.

HammadB · 2024-09-19T17:32:44Z

chromadb/api/async_fastapi.py

@@ -449,7 +449,7 @@ async def _add(
        documents: Optional[Documents] = None,
        uris: Optional[URIs] = None,
    ) -> bool:
-        batch = (ids, embeddings, metadatas, documents, uris)
+        batch = (ids, [embedding.tolist() for embedding in embeddings] if embeddings is not None else None, metadatas, documents, uris)


When is embeddings ever None here? I don't understand the defensive check because the input type is not None?

HammadB · 2024-09-19T17:32:53Z

chromadb/api/async_fastapi.py

@@ -465,7 +465,7 @@ async def _update(
        documents: Optional[Documents] = None,
        uris: Optional[URIs] = None,
    ) -> bool:
-        batch = (ids, embeddings, metadatas, documents, uris)
+        batch = (ids, [embedding.tolist() for embedding in embeddings] if embeddings is not None else None, metadatas, documents, uris)


Same comment

embeddings can be None here—both given the method definition and if you trace through the codepath. In _validate_and_prepare_update_request, we call _validate_embedding_set with require_embeddings_or_data set to False, which means we can end up sending self._client._update a None for embeddings.

HammadB · 2024-09-19T17:33:26Z

chromadb/api/fastapi.py

@@ -418,7 +418,7 @@ def _add(
        Adds a batch of embeddings to the database
        - pass in column oriented data lists
        """
-        batch = (ids, embeddings, metadatas, documents, uris)
+        batch = (ids, [embedding.tolist() for embedding in embeddings] if embeddings is not None else None, metadatas, documents, uris)


nit: maybe move the [np] conversion out into a convenience method

HammadB · 2024-09-19T17:52:29Z

chromadb/api/models/Collection.py

@@ -136,14 +137,14 @@ def peek(self, limit: int = 10) -> GetResult:
        Returns:
            GetResult: A GetResult object containing the results.
        """
-        return self._client._peek(self.id, limit)
+        return self._transform_peek_response(self._client._peek(self.id, limit))


Why does async collection not need this change?

Good catch!

HammadB · 2024-09-19T17:57:02Z

chromadb/api/models/CollectionCommon.py

@@ -327,6 +334,9 @@ def _transform_get_response(
        ):
            response["data"] = self._data_loader(response["uris"])

+        if ("embeddings" in include and response["embeddings"] is not None):
+            response["embeddings"] = np.array(response["embeddings"])


we should note: this is a breaking API change and needs documentation

HammadB · 2024-09-19T17:58:02Z

chromadb/api/models/CollectionCommon.py

-        if isinstance(embeddings, np.ndarray):
-            return embeddings.tolist()  # type: ignore
-        return embeddings  # type: ignore
+        return cast(Embeddings, [np.array(embedding) for embedding in embeddings])


question: do we validate the shape of PyEmbedding anywhere?

Do you mean do we validate that the embeddings the user provide have the same dimensionality? If so, we do in chromadb/api/segment.py on line 828 in _validate_dimension.

drewkim changed the title ~~Convert embeddings representation to numpy~~ [Update][Ease of Use] Convert embeddings representation to numpy Sep 16, 2024

drewkim changed the title ~~[Update][Ease of Use] Convert embeddings representation to numpy~~ [PERF] Convert embeddings representation to numpy Sep 16, 2024

drewkim requested review from HammadB, codetheweb and sanketkedia September 16, 2024 22:26

codetheweb linked an issue Sep 17, 2024 that may be closed by this pull request

[Update][Ease of Use] Numpy everywhere #2665

Open

codetheweb reviewed Sep 17, 2024

View reviewed changes

drewkim added 7 commits September 17, 2024 17:13

Convert embeddings representation to numpy

94b9c2b

cleanup

bd0eb42

Clean up some test failures due to np array float conversion

29671ec

Trigger build - updated title

a406ca3

Trigger build - updated title

5a34573

Fix conversion from list back to numpy post deserialization

da25a73

Address comments

36a3bfc

drewkim force-pushed the 09-15-convert_embeddings_representation_to_numpy branch from 21630d8 to 36a3bfc Compare September 18, 2024 01:05

Lint

31eb55b

HammadB reviewed Sep 19, 2024

View reviewed changes

drewkim added 2 commits September 19, 2024 16:40

Addressing comments

514ffbd

lint

07c7e48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERF] Convert embeddings representation to numpy #2803

[PERF] Convert embeddings representation to numpy #2803

drewkim commented Sep 16, 2024 •

edited

Loading

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

codetheweb commented Sep 17, 2024

codetheweb Sep 17, 2024

drewkim Sep 19, 2024

codetheweb Sep 19, 2024

codetheweb Sep 17, 2024

drewkim Sep 18, 2024

HammadB Sep 19, 2024

HammadB Sep 19, 2024

drewkim Sep 19, 2024

HammadB Sep 19, 2024

drewkim Sep 19, 2024

HammadB Sep 19, 2024

HammadB Sep 19, 2024

drewkim Sep 19, 2024

HammadB Sep 19, 2024 •

edited

Loading

HammadB Sep 19, 2024

drewkim Sep 19, 2024

[PERF] Convert embeddings representation to numpy #2803

Are you sure you want to change the base?

[PERF] Convert embeddings representation to numpy #2803

Conversation

drewkim commented Sep 16, 2024 • edited Loading

Description of changes

Test plan

github-actions bot commented Sep 16, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

codetheweb commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HammadB Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewkim commented Sep 16, 2024 •

edited

Loading

HammadB Sep 19, 2024 •

edited

Loading