Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add projection extractors and use them consistently & add flow_published_at #1098

Merged
merged 3 commits into from
Jul 11, 2023

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Jun 29, 2023

This is a series of refactors building off of the recent shuffle-service refactor, which introduces a doc::Extractor type that manages the details of querying / extracting / comparing the application of a projection to a document.

doc::Extractor implements handling for default values and, via the last commit, support for extracting date-time values from a document UUID.

The whole runtime is updated to use doc::Extractor consistently:

  • when extracting from documents for shuffling
  • when comparing over document keys for combining
  • when extracting fields from documents while draining combiners.
  • when using $parameters within derive-sqlite.

The final commit adds a new flow_published_at projection (alongside flow_document) which extracts an RFC3339 date-time contained within the /_meta/uuid Gazette UUID.

The field name is currently over-rideable by creating an explicit projection of my_field_name: /_meta/uuid/date-time. This could change in the future 🤔 : alternatively, a projection of any uuid having JSON schema keywords format:"date-time", contentEncoding:"uuid" could potentially be made to work, but I didn't take it that far and, for the moment, I think we should consider this an undocumented escape hatch.

Workflow steps:

  • All collections will have a new, available flow_published_at projection.
  • Collections which don't require their keys will now combine correctly with respect to defined default value annotations.

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jgraettinger jgraettinger force-pushed the johnny/uuid-timestamp branch 2 times, most recently from 3ec20d8 to 94fc0d4 Compare June 29, 2023 19:48
@jgraettinger jgraettinger changed the title add flow_document projection which extracts date-time from document UUID add projection extractors and use them consistently & add flow_published_at Jun 29, 2023
@jgraettinger jgraettinger marked this pull request as ready for review June 29, 2023 19:56
Copy link
Contributor

@jshearer jshearer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the Extractor concept, it certainly feels like the right abstraction here. I had a couple of questions, but other than that this LGTM

)
.ok()
}) {
return Err(Cow::Owned(serde_json::Value::String(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unnecessarily overloading the meaning of Result? It feels like this should be an enum instead

enum QueryResult<'s, 'n, N: AsNode> {
  Resolved(&'n N),
  Computed(Cow<'s, serde_json::Value>)
  Default(Cow<'s, serde_json::Value>),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It certainly could be a new enum, but using Result to represent a found / not-found result is in keeping with how the standard library uses it. For example https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search

In this sense the Result built-ins are nice because you can do things like result.ok() to discard an alternative value and use it strictly as an Option over "found" status.

Copy link
Contributor

@jshearer jshearer Jul 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unlike binary_search, this function always returns "something", specifically something that has an implementation of AsNode.

What would it mean to use query(..)?, for example? Or even query(..).ok()? The whole point is to infallibly extract a value from a document, no matter whether that location is actually present or not. The abstraction of Extractor works so well for exactly this reason: you can hide how a location is extracted, so long as you get some kind of AsNode value out of query().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm losing the thread here. query(..) returns an extracted value, which can either be a literal document node (Ok) or an alternative value (Err).

.iter()
.map(AsRef::as_ref)
.map(
|field| match projections.binary_search_by_key(&field, |p| &p.field) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as https://github.com/estuary/flow/pull/1098/files#diff-bf5f2daed95a35cac89611b90ad1e76e41c9ca3ea3ba85e46c8132d749d93912R27? Or rather, why do a binary search here but an .iter().find() up there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projections are guaranteed to be in lexicographic field order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we searching through a list of projections in both places though? Or is it that the call to projections.sort_by_key(|p| !p.explicit); breaks the guarantee of lexicographic ordering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're searching using different predicates. Projections are in field order but not ptr order, so searching for a key pointer has no ordering guarantee.

crates/doc/src/extractor.rs Show resolved Hide resolved
Extractor is a new type which will encapsulate the means by which
document locations are extracted. It will wrap up details like default
values, content encoding, UUID timestamp extraction, and other behaviors
that are ultimately configured from the collection JSON schema.

doc::combine is switched over to use Extractor.

Schema reductions are *not* done in terms of projections, but use only
JSON pointers, so it's necessary to have a differentiated mechanism for
comparing JSON values in the context of document reductions. For
reductions, there are no default values and we instead order `undefined`
as _before_ any other JSON value type.
Update all uses of key extraction and doc::Combiner to use the
doc::Extractor pattern. Generally this means threading through
projections so that a doc::Extractor can be built from either key or
field lookups of those projections.

Add an `extractors` crate which encapsulates building extractors from an
array of key pointers or field names.
Create a `flow_published_at` projection for the /_meta/uuid, which is
over-rideable through a user-provided projection of
/_meta/uuid/date-time

Use inference of format: "date-time", contentEncoding: "uuid" to
communicate that the timestamp contained within the UUID is to be
extracted and formatted as an RFC3339 date-time.

Update doc::Extractor to handle this case.

Issue #934
@jgraettinger jgraettinger merged commit b27d36c into master Jul 11, 2023
5 checks passed
@jgraettinger jgraettinger deleted the johnny/uuid-timestamp branch July 11, 2023 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants