Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Generate IDs when not given in add #2699

Open
wants to merge 60 commits into
base: spike/generate_ids_move_validation
Choose a base branch
from

Conversation

spikechroma
Copy link

@spikechroma spikechroma commented Aug 21, 2024

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • when a user uses add on an collection, they are no longer required to pass in an array of IDs. They will be automatically generated if not given on the server side.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

Copy link
Author

spikechroma commented Aug 21, 2024

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @spikechroma and the rest of your teammates on Graphite Graphite

@spikechroma spikechroma changed the title generate ids [ENH] Refactor code to extract out unpacking embeddings set from existing validation logic Aug 21, 2024
@spikechroma spikechroma marked this pull request as ready for review August 21, 2024 21:34
@spikechroma spikechroma changed the title [ENH] Refactor code to extract out unpacking embeddings set from existing validation logic [ENH] Generate IDs when not given in upsert and add Aug 21, 2024
Copy link
Contributor

@atroyn atroyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline about moving the ID generation down into the server logic, and whether that would mean having to move the validators further down.

I think the right thing is to move both down into the server; we may want to make the validation logic change before making this one. Let's discuss further what the right answer might be.

@@ -292,6 +296,10 @@ def validate_ids(ids: IDs) -> IDs:
for id_ in ids:
if not isinstance(id_, str):
raise ValueError(f"Expected ID to be a str, got {id_}")

if len(id_) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the fence about this still.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can discuss offline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty string is a valid identifier, and would still be unique - I have no idea why anyone would want to do this, but let's let them.

@@ -181,7 +181,7 @@ def _unpack_embedding_set(

def _validate_embedding_set(
self,
ids: IDs,
ids: Optional[IDs],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ids can be Optional here, since validate_ids fails on None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. however, if we are generating ids on the server level, ids should be allowed to be optional here

chromadb/api/models/Collection.py Outdated Show resolved Hide resolved
chromadb/api/models/AsyncCollection.py Outdated Show resolved Hide resolved
chromadb/api/models/AsyncCollection.py Outdated Show resolved Hide resolved
Copy link
Contributor

@atroyn atroyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline about moving the ID generation down into the server logic, and whether that would mean having to move the validators further down.

I think the right thing is to move both down into the server; we may want to make the validation logic change before making this one. Let's discuss further what the right answer might be.

@@ -458,17 +458,26 @@ def recordsets(
num_unique_metadata: Optional[int] = None,
min_metadata_size: int = 0,
max_metadata_size: Optional[int] = None,
# ids can only be optional for add operations
for_add: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this parameter name is confusing and underspecified - can we clarify it or somehow make this less option-heavy?

chromadb/test/test_api.py Outdated Show resolved Hide resolved
@staticmethod
def _generate_ids_when_not_present(
ids: Optional[List[str]],
documents: Optional[List[Optional[str]]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why do you need to pass docs/urs/embeddings into this, it makes more sense to me that this would take the value N that it needs

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this change is partways (or if I have missed something), but as is this change can't be merged as its broken on local (not client/server or "single-node") chroma. Putting the id generation in server/fastapi is not the right way to get this to be uniform across local/deployed single node. We should put the change at the level of "ServerAPI" so that it runs at the conceptual server layer. I am happy to go over the code architecture tomorrow to make this clearer / discuss more.

I also think we need more testing against this functionality since its additive and doesn't break existing tests. The property test modifications are good, and I think basic point unit tests are warranted too.

@HammadB
Copy link
Collaborator

HammadB commented Aug 22, 2024

This stack addresses #164 in part, where we agreed validation duplication is preferred.

cc @levand @jeffchuber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants