BigQuery: Provide handler for schema evolution for custom events and contexts #54

colmsnowplow · 2021-03-08T14:28:18Z

New schema versions create new columns in BigQuery - which need to be coalesced, but also pose the problem that some versions might not exist in the Database.

We solve this issue for the core enrichment contexts in #52 by using a stored procedure to extract the relevant data into a scratch table.

For the sake of solving the problem at hand, the initial implementation only handles top-level fields which aren't arrays or structs, and only uses the first item in the array.

This or a similar pattern could be amended to handle those more complicated cases, and offer a generic means of handling schema evolution for any custom BQ column.

The trickiest part being that a changing datatype in a struct or array of structs makes the column incompatible with its previous form. If we solve this problem, we can solve the single biggest pain point of working with Snowplow data in BigQuery.

colmsnowplow · 2021-03-08T14:33:36Z

It makes most sense to my mind to structure this as two separate stored procedures, one for events and one for contexts.

Structs and arrays can be handled by creating new objects of the same type, or flattening - not sure which makes most sense at the moment.

I think it's acceptable to pear it down to a minimal implementation first (eg. only top-level structs and arrays), since the vast majority of users don't heavily use structs or arrays. Those fields not handled can be omitted or included but not coalesced - perhaps the latter makes most sense since that allows people to handle those cases themselves.

colmsnowplow · 2021-03-12T13:46:30Z

A possible approach:

Javascript UDF takes an array of struct or array field paths as input, outputs the superset key-value pairs it finds at those paths (prioritising latest version) - as a struct.
Call that UDF within the stored procedure on any array/struct fields found.

Need to check whether the key is always present even if one row has no value for a given key. If not, we may need something more complex in the UDF.

Also the output can change across runs, so it could only be used to produce scratch tables (but I don't think there's any way to avoid this).

colmsnowplow mentioned this issue Mar 8, 2021

Bigquery: Provide mechanism for COALESCE of columns #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Provide handler for schema evolution for custom events and contexts #54

BigQuery: Provide handler for schema evolution for custom events and contexts #54

colmsnowplow commented Mar 8, 2021

colmsnowplow commented Mar 8, 2021

colmsnowplow commented Mar 12, 2021

BigQuery: Provide handler for schema evolution for custom events and contexts #54

BigQuery: Provide handler for schema evolution for custom events and contexts #54

Comments

colmsnowplow commented Mar 8, 2021

colmsnowplow commented Mar 8, 2021

colmsnowplow commented Mar 12, 2021