Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigQuery: Provide handler for schema evolution for custom events and contexts #54

Open
colmsnowplow opened this issue Mar 8, 2021 · 2 comments

Comments

@colmsnowplow
Copy link
Collaborator

New schema versions create new columns in BigQuery - which need to be coalesced, but also pose the problem that some versions might not exist in the Database.

We solve this issue for the core enrichment contexts in #52 by using a stored procedure to extract the relevant data into a scratch table.

For the sake of solving the problem at hand, the initial implementation only handles top-level fields which aren't arrays or structs, and only uses the first item in the array.

This or a similar pattern could be amended to handle those more complicated cases, and offer a generic means of handling schema evolution for any custom BQ column.

The trickiest part being that a changing datatype in a struct or array of structs makes the column incompatible with its previous form. If we solve this problem, we can solve the single biggest pain point of working with Snowplow data in BigQuery.

@colmsnowplow
Copy link
Collaborator Author

It makes most sense to my mind to structure this as two separate stored procedures, one for events and one for contexts.

Structs and arrays can be handled by creating new objects of the same type, or flattening - not sure which makes most sense at the moment.

I think it's acceptable to pear it down to a minimal implementation first (eg. only top-level structs and arrays), since the vast majority of users don't heavily use structs or arrays. Those fields not handled can be omitted or included but not coalesced - perhaps the latter makes most sense since that allows people to handle those cases themselves.

@colmsnowplow
Copy link
Collaborator Author

A possible approach:

  • Javascript UDF takes an array of struct or array field paths as input, outputs the superset key-value pairs it finds at those paths (prioritising latest version) - as a struct.
  • Call that UDF within the stored procedure on any array/struct fields found.

Need to check whether the key is always present even if one row has no value for a given key. If not, we may need something more complex in the UDF.

Also the output can change across runs, so it could only be used to produce scratch tables (but I don't think there's any way to avoid this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant