feat(similarity-embedding): Create bulk insert record endpoint #480

jangjodi · 2024-03-26T19:02:37Z

closes #476

jangjodi · 2024-03-26T19:03:40Z

src/seer/grouping/grouping.py

+        :param batch_size: The batch size used for the computation.
+        :return: The embeddings of the stacktraces.
+        """
+        return self.model.encode(sentences=stacktraces, batch_size=batch_size)


Attempt at bulk encoding stacktraces

yeah this should work fine for single GPU - we potentially want to modify this to this but let me sync with SRE first

jangjodi · 2024-03-26T19:04:12Z

src/seer/grouping/grouping.py

+        :param records: List of records to be inserted
+        """
+        with Session() as session:
+            session.bulk_save_objects(records)


do you happen to know if/how this might fail?

LGTM @corps any thoughts/concerns?

The most likely failure would be a timeout if the bulk is too big. One alternative is to attempt to write in a discretely sized chunks. It's a halting problem though so there is never any guarantee that you won't hit a timeout, but as long as we have a lever to adjust (batch size) we'd be good.

To do batch processing you'd essentially break your input up into sized chunks do something like this:

for chunk in chunks: session.bulk_save_objects(chunk) session.flush() # writes to network but doesn't commit session.commit() # still atomic with regards to full set of records

I had a discussion with Tillman about this (after he posted his comment). It probably will be better to either make the bulk insert an asynchronous operation or move it out of this service into the component that handles doc indexing.

Thanks for the feedback @ram-senth! Which component handles doc indexing?

Thanks @corps! I'll have a process to break the input into chunks on the sentry side that calls this

@jangjodi I'm setting up a call with Jenn for later today to ramp up on the indexing component. Will add you as optional. If you can join that will be great.

src/seer/grouping/grouping.py

corps · 2024-03-29T16:18:59Z

src/seer/app.py

+    data: CreateGroupingRecordsRequest,
+) -> BulkCreateGroupingRecordsResponse:
+    with sentry_sdk.start_span(
+        op="seer.grouping-record", description="grouping record bulk insert"


random thought: should we just put the sentry_sdk.start_span logic into json_api and derive an op from the url? Tiny savings of effort.

Not a PR blocker, just a thought.

corps

No blockers, just nits.

ram-senth · 2024-03-29T16:40:28Z

I have a high level question and a potential concern. With this current implementation, both the long running bulk insert calls and the shorter online transactional calls are routed to the same system. Should we be concerned about one interfering with other? Another point from Tillman was that this current implementation is done synchronously and can take about 30 seconds for a batch of 1000 events.

feat(similarity-embedding): Create bulk insert record endpoint

4a1fccf

jangjodi commented Mar 26, 2024

View reviewed changes

jangjodi requested a review from trillville March 26, 2024 19:04

fix: Typing

4512575

jangjodi requested a review from ram-senth March 28, 2024 18:08

trillville reviewed Mar 29, 2024

View reviewed changes

src/seer/grouping/grouping.py Show resolved Hide resolved

corps reviewed Mar 29, 2024

View reviewed changes

corps approved these changes Mar 29, 2024

View reviewed changes

jangjodi and others added 2 commits March 29, 2024 10:02

Merge branch 'main' into jodi/bulk-insert-grouping-record

77ea78f

fix: Merge

fa46965

jangjodi mentioned this pull request Apr 8, 2024

feat(similarity-embedding): Create backfill script for inserting records getsentry/sentry#68466

Merged

jangjodi and others added 4 commits April 8, 2024 14:34

fix: Merge conflict

0715eae

Merge branch 'main' into jodi/bulk-insert-grouping-record

7b722da

Merge branch 'main' into jodi/bulk-insert-grouping-record

9cb5ce8

Merge branch 'main' into jodi/bulk-insert-grouping-record

b4abdc0

jangjodi mentioned this pull request May 10, 2024

feat(similarity_embedding): Create bulk insert endpoint #695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(similarity-embedding): Create bulk insert record endpoint #480

feat(similarity-embedding): Create bulk insert record endpoint #480

jangjodi commented Mar 26, 2024 •

edited

Loading

jangjodi Mar 26, 2024

trillville Mar 29, 2024

jangjodi Mar 26, 2024

trillville Mar 29, 2024

corps Mar 29, 2024

ram-senth Mar 29, 2024 •

edited

Loading

jangjodi Mar 29, 2024

jangjodi Mar 29, 2024

ram-senth Mar 29, 2024

corps Mar 29, 2024

corps left a comment

ram-senth commented Mar 29, 2024

feat(similarity-embedding): Create bulk insert record endpoint #480

Are you sure you want to change the base?

feat(similarity-embedding): Create bulk insert record endpoint #480

Conversation

jangjodi commented Mar 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ram-senth Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corps left a comment

Choose a reason for hiding this comment

ram-senth commented Mar 29, 2024

jangjodi commented Mar 26, 2024 •

edited

Loading

ram-senth Mar 29, 2024 •

edited

Loading