-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(similarity-embedding): Create bulk insert record endpoint #480
base: main
Are you sure you want to change the base?
Conversation
:param batch_size: The batch size used for the computation. | ||
:return: The embeddings of the stacktraces. | ||
""" | ||
return self.model.encode(sentences=stacktraces, batch_size=batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attempt at bulk encoding stacktraces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this should work fine for single GPU - we potentially want to modify this to this but let me sync with SRE first
:param records: List of records to be inserted | ||
""" | ||
with Session() as session: | ||
session.bulk_save_objects(records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you happen to know if/how this might fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @corps any thoughts/concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most likely failure would be a timeout if the bulk is too big. One alternative is to attempt to write in a discretely sized chunks. It's a halting problem though so there is never any guarantee that you won't hit a timeout, but as long as we have a lever to adjust (batch size) we'd be good.
To do batch processing you'd essentially break your input up into sized chunks do something like this:
for chunk in chunks:
session.bulk_save_objects(chunk)
session.flush() # writes to network but doesn't commit
session.commit() # still atomic with regards to full set of records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a discussion with Tillman about this (after he posted his comment). It probably will be better to either make the bulk insert an asynchronous operation or move it out of this service into the component that handles doc indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @ram-senth! Which component handles doc indexing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @corps! I'll have a process to break the input into chunks on the sentry side that calls this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangjodi I'm setting up a call with Jenn for later today to ramp up on the indexing component. Will add you as optional. If you can join that will be great.
data: CreateGroupingRecordsRequest, | ||
) -> BulkCreateGroupingRecordsResponse: | ||
with sentry_sdk.start_span( | ||
op="seer.grouping-record", description="grouping record bulk insert" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random thought: should we just put the sentry_sdk.start_span logic into json_api
and derive an op from the url? Tiny savings of effort.
Not a PR blocker, just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blockers, just nits.
I have a high level question and a potential concern. With this current implementation, both the long running bulk insert calls and the shorter online transactional calls are routed to the same system. Should we be concerned about one interfering with other? Another point from Tillman was that this current implementation is done synchronously and can take about 30 seconds for a batch of 1000 events. |
closes #476