-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add projection extractors and use them consistently & add flow_published_at
#1098
Conversation
3ec20d8
to
94fc0d4
Compare
flow_document
projection which extracts date-time from document UUIDflow_published_at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the Extractor
concept, it certainly feels like the right abstraction here. I had a couple of questions, but other than that this LGTM
) | ||
.ok() | ||
}) { | ||
return Err(Cow::Owned(serde_json::Value::String( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unnecessarily overloading the meaning of Result
? It feels like this should be an enum instead
enum QueryResult<'s, 'n, N: AsNode> {
Resolved(&'n N),
Computed(Cow<'s, serde_json::Value>)
Default(Cow<'s, serde_json::Value>),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It certainly could be a new enum, but using Result
to represent a found / not-found result is in keeping with how the standard library uses it. For example https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search
In this sense the Result
built-ins are nice because you can do things like result.ok()
to discard an alternative value and use it strictly as an Option
over "found" status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But unlike binary_search, this function always returns "something", specifically something that has an implementation of AsNode
.
What would it mean to use query(..)?
, for example? Or even query(..).ok()
? The whole point is to infallibly extract a value from a document, no matter whether that location is actually present or not. The abstraction of Extractor
works so well for exactly this reason: you can hide how a location is extracted, so long as you get some kind of AsNode
value out of query()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm losing the thread here. query(..)
returns an extracted value, which can either be a literal document node (Ok
) or an alternative value (Err
).
.iter() | ||
.map(AsRef::as_ref) | ||
.map( | ||
|field| match projections.binary_search_by_key(&field, |p| &p.field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as https://github.com/estuary/flow/pull/1098/files#diff-bf5f2daed95a35cac89611b90ad1e76e41c9ca3ea3ba85e46c8132d749d93912R27? Or rather, why do a binary search here but an .iter().find()
up there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
projections are guaranteed to be in lexicographic field order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we searching through a list of projections in both places though? Or is it that the call to projections.sort_by_key(|p| !p.explicit);
breaks the guarantee of lexicographic ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're searching using different predicates. Projections are in field
order but not ptr
order, so searching for a key pointer has no ordering guarantee.
Extractor is a new type which will encapsulate the means by which document locations are extracted. It will wrap up details like default values, content encoding, UUID timestamp extraction, and other behaviors that are ultimately configured from the collection JSON schema. doc::combine is switched over to use Extractor. Schema reductions are *not* done in terms of projections, but use only JSON pointers, so it's necessary to have a differentiated mechanism for comparing JSON values in the context of document reductions. For reductions, there are no default values and we instead order `undefined` as _before_ any other JSON value type.
Update all uses of key extraction and doc::Combiner to use the doc::Extractor pattern. Generally this means threading through projections so that a doc::Extractor can be built from either key or field lookups of those projections. Add an `extractors` crate which encapsulates building extractors from an array of key pointers or field names.
Create a `flow_published_at` projection for the /_meta/uuid, which is over-rideable through a user-provided projection of /_meta/uuid/date-time Use inference of format: "date-time", contentEncoding: "uuid" to communicate that the timestamp contained within the UUID is to be extracted and formatted as an RFC3339 date-time. Update doc::Extractor to handle this case. Issue #934
94fc0d4
to
9a58b22
Compare
This is a series of refactors building off of the recent shuffle-service refactor, which introduces a
doc::Extractor
type that manages the details of querying / extracting / comparing the application of a projection to a document.doc::Extractor
implements handling for default values and, via the last commit, support for extracting date-time values from a document UUID.The whole runtime is updated to use
doc::Extractor
consistently:$parameters
withinderive-sqlite
.The final commit adds a new
flow_published_at
projection (alongsideflow_document
) which extracts an RFC3339 date-time contained within the/_meta/uuid
Gazette UUID.The field name is currently over-rideable by creating an explicit projection of
my_field_name: /_meta/uuid/date-time
. This could change in the future 🤔 : alternatively, a projection of any uuid having JSON schema keywordsformat:"date-time", contentEncoding:"uuid"
could potentially be made to work, but I didn't take it that far and, for the moment, I think we should consider this an undocumented escape hatch.Workflow steps:
flow_published_at
projection.default
value annotations.Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is