Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redshift: Add deduplication #50

Open
colmsnowplow opened this issue Feb 23, 2021 · 0 comments
Open

Redshift: Add deduplication #50

colmsnowplow opened this issue Feb 23, 2021 · 0 comments

Comments

@colmsnowplow
Copy link
Collaborator

A rare edge case can occur with current 'exclude all' duplicates strategy, when an event is processed, and a subsequent run contains duplicates of that event along with other, legitimate events. For example:

A run contains a page view with event id 123 - this event has page view in session index = 1.

A subsequent run contains a duplicate of that event, along with another, legitimate page view event in the same session. The data from that session in this run will be:

page view event - event ID: 123
page view event - event ID: 123
page view event - event ID: 456

In this second run, the already processed event 123 will be removed by deduplication, and the new one 456 will be assigned page view in session index of 1.

Page View 123 won't be removed from the table, so we will have a session with two page views of page view in session index of 1.


We might solve this by using session_id to update the table, but this feels somewhat fragile.

We can also solve it by implementing better deduplication logic - to keep the first event_id (by collector_tstamp).

The tricky part is that ideally we only keep the first event IF the collector_tstamp is not duplicated also, and remove both otherwise (to avoid cartesian join). However, if we remove both we still have a chance to hit this issue.

One way out of that is to implement a mechanism to apply the incremental logic to all relevant atomic tables (thereby creating deduplicated _staged tables for every join that might be involved in a customisation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant