add projection extractors and use them consistently & add `flow_published_at` #1098

jgraettinger · 2023-06-29T04:38:52Z

This is a series of refactors building off of the recent shuffle-service refactor, which introduces a doc::Extractor type that manages the details of querying / extracting / comparing the application of a projection to a document.

doc::Extractor implements handling for default values and, via the last commit, support for extracting date-time values from a document UUID.

The whole runtime is updated to use doc::Extractor consistently:

when extracting from documents for shuffling
when comparing over document keys for combining
when extracting fields from documents while draining combiners.
when using $parameters within derive-sqlite.

The final commit adds a new flow_published_at projection (alongside flow_document) which extracts an RFC3339 date-time contained within the /_meta/uuid Gazette UUID.

The field name is currently over-rideable by creating an explicit projection of my_field_name: /_meta/uuid/date-time. This could change in the future 🤔 : alternatively, a projection of any uuid having JSON schema keywords format:"date-time", contentEncoding:"uuid" could potentially be made to work, but I didn't take it that far and, for the moment, I think we should consider this an undocumented escape hatch.

Workflow steps:

All collections will have a new, available flow_published_at projection.
Collections which don't require their keys will now combine correctly with respect to defined default value annotations.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

jshearer

I love the Extractor concept, it certainly feels like the right abstraction here. I had a couple of questions, but other than that this LGTM

jshearer · 2023-07-10T16:23:28Z

crates/doc/src/extractor.rs

+                )
+                .ok()
+            }) {
+                return Err(Cow::Owned(serde_json::Value::String(


Is this unnecessarily overloading the meaning of Result? It feels like this should be an enum instead

enum QueryResult<'s, 'n, N: AsNode> { Resolved(&'n N), Computed(Cow<'s, serde_json::Value>) Default(Cow<'s, serde_json::Value>), }

It certainly could be a new enum, but using Result to represent a found / not-found result is in keeping with how the standard library uses it. For example https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search

In this sense the Result built-ins are nice because you can do things like result.ok() to discard an alternative value and use it strictly as an Option over "found" status.

But unlike binary_search, this function always returns "something", specifically something that has an implementation of AsNode.

What would it mean to use query(..)?, for example? Or even query(..).ok()? The whole point is to infallibly extract a value from a document, no matter whether that location is actually present or not. The abstraction of Extractor works so well for exactly this reason: you can hide how a location is extracted, so long as you get some kind of AsNode value out of query().

I'm losing the thread here. query(..) returns an extracted value, which can either be a literal document node (Ok) or an alternative value (Err).

jshearer · 2023-07-10T16:28:49Z

crates/extractors/src/lib.rs

+        .iter()
+        .map(AsRef::as_ref)
+        .map(
+            |field| match projections.binary_search_by_key(&field, |p| &p.field) {


Is this the same as https://github.com/estuary/flow/pull/1098/files#diff-bf5f2daed95a35cac89611b90ad1e76e41c9ca3ea3ba85e46c8132d749d93912R27? Or rather, why do a binary search here but an .iter().find() up there?

projections are guaranteed to be in lexicographic field order.

https://github.com/estuary/flow/blob/master/go/protocols/flow/flow.proto#L161-L162

Aren't we searching through a list of projections in both places though? Or is it that the call to projections.sort_by_key(|p| !p.explicit); breaks the guarantee of lexicographic ordering?

We're searching using different predicates. Projections are in field order but not ptr order, so searching for a key pointer has no ordering guarantee.

crates/doc/src/extractor.rs

Extractor is a new type which will encapsulate the means by which document locations are extracted. It will wrap up details like default values, content encoding, UUID timestamp extraction, and other behaviors that are ultimately configured from the collection JSON schema. doc::combine is switched over to use Extractor. Schema reductions are *not* done in terms of projections, but use only JSON pointers, so it's necessary to have a differentiated mechanism for comparing JSON values in the context of document reductions. For reductions, there are no default values and we instead order `undefined` as _before_ any other JSON value type.

Update all uses of key extraction and doc::Combiner to use the doc::Extractor pattern. Generally this means threading through projections so that a doc::Extractor can be built from either key or field lookups of those projections. Add an `extractors` crate which encapsulates building extractors from an array of key pointers or field names.

Create a `flow_published_at` projection for the /_meta/uuid, which is over-rideable through a user-provided projection of /_meta/uuid/date-time Use inference of format: "date-time", contentEncoding: "uuid" to communicate that the timestamp contained within the UUID is to be extracted and formatted as an RFC3339 date-time. Update doc::Extractor to handle this case. Issue #934

jgraettinger force-pushed the johnny/uuid-timestamp branch 2 times, most recently from 3ec20d8 to 94fc0d4 Compare June 29, 2023 19:48

jgraettinger changed the title ~~add flow_document projection which extracts date-time from document UUID~~ add projection extractors and use them consistently & add flow_published_at Jun 29, 2023

jgraettinger requested a review from jshearer June 29, 2023 19:56

jgraettinger marked this pull request as ready for review June 29, 2023 19:56

jshearer approved these changes Jul 10, 2023

View reviewed changes

jgraettinger added 3 commits July 10, 2023 22:34

jgraettinger force-pushed the johnny/uuid-timestamp branch from 94fc0d4 to 9a58b22 Compare July 10, 2023 22:37

jgraettinger merged commit b27d36c into master Jul 11, 2023
5 checks passed

jgraettinger deleted the johnny/uuid-timestamp branch July 11, 2023 02:29

jshearer mentioned this pull request Jul 12, 2023

Feature/UUID timestamp extraction #1070

Closed

williamhbaker mentioned this pull request Apr 23, 2024

materializations: _meta/capture_time estuary/connectors#579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add projection extractors and use them consistently & add `flow_published_at` #1098

add projection extractors and use them consistently & add `flow_published_at` #1098

jgraettinger commented Jun 29, 2023 •

edited

Loading

jshearer left a comment

jshearer Jul 10, 2023

jgraettinger Jul 10, 2023

jshearer Jul 10, 2023 •

edited

Loading

jgraettinger Jul 11, 2023

jshearer Jul 10, 2023

jgraettinger Jul 10, 2023

jgraettinger Jul 10, 2023

jshearer Jul 10, 2023

jgraettinger Jul 11, 2023

add projection extractors and use them consistently & add flow_published_at #1098

add projection extractors and use them consistently & add flow_published_at #1098

Conversation

jgraettinger commented Jun 29, 2023 • edited Loading

jshearer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshearer Jul 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

add projection extractors and use them consistently & add `flow_published_at` #1098

add projection extractors and use them consistently & add `flow_published_at` #1098

jgraettinger commented Jun 29, 2023 •

edited

Loading

jshearer Jul 10, 2023 •

edited

Loading