-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
end-to-end continuous schema inference #1178
Conversation
1798ce0
to
76722bf
Compare
Remove the migration shim which dynamically builds collections from their user-level specs. Update flowctl to directly retrieve built specs rather than the user-level spec.
76722bf
to
28b021e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this is looking pretty good to me. I left a few minor questions and comments which would be good to resolve prior to merging.
@@ -62,7 +62,7 @@ impl BuildOutput { | |||
.iter() | |||
.map(|e| Error { | |||
scope: Some(e.scope.to_string()), | |||
detail: e.error.to_string(), | |||
detail: format!("{:#}", e.error), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL the alternate format will include causes 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment too.
crates/validation/src/collection.rs
Outdated
// Add a definition for an inferred schema if it's provided. | ||
// Note that we previously filtered the set of retrieved schemas to those | ||
// having a read schema matching super::REF_INFERRED_SCHEMA_PATTERN. | ||
if let Some(inferred_bundle) = inferred_bundle { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is no inferred schema yet because this is a new collection? For example, what would happen if I published:
collections:
acmeCo/some/new/collection:
writeSchema:
properties:
id: { type: string}
required: [id]
readSchema:
allOf:
- $ref: flow://write-schema
- $ref: flow://inferred-schema
key: [/id]
It seems like it'd fail to resolve the inferred schema ref. I'm wondering if we should use some default placeholder (true
?) if there isn't an inferred schema yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is definitely to use a placeholder {}
(equivalent to true
). I definitely had it wired up that way, and confirmed it works as intended in end-to-end testing, but now I'm doubting myself and wondering if i screwed up a refactor. I'll double-check.
ETA: Yeeeep, I goofed this in a refactor. Fixing and adding more test cases...
crates/validation/src/lib.rs
Outdated
// string would be quote-escaped. | ||
// * It must be a schema keyword ($ref cannot be, say, a property) because | ||
// "flow://inferred-schema" is not a valid JSON schema and would error at build time. | ||
const REF_INFERRED_SCHEMA_PATTERN: &str = "\"$ref\":\"flow://inferred-schema\""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sensitive to whitespace around the :
, and it took me a while to prove to myself that there aren't any code paths that could result in a "$ref" : "flow://inferred-schema"
. I think this works because Loader::load_resource_content
will always re-serialize the schema without whitespace. Assuming that's correct, I think it at least warrants a comment.
I'll suggest going just a little further and factoring this out into fn uses_*_schema_ref(schema: &str) -> bool
instead of using schema.contains(REF_*)
, which can carry a comment about schema serialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it felt brittle. I've moved this into members of models::Schema
, using whitespace-invarant regex's under the hood.
28b021e
to
bed8485
Compare
I'm concurrently doing another round of end-to-end testing, but I think this is right. PTAL |
|
Introduce well-known URIs `flow://write-schema` and `flow://inferred-schema` which are available to collection read schemas. These URIs may be `$ref`'d from within a read schema, and having done so, their definitions are inlined into the effective read schema of the built collection with every publication. Extend the `validation::ControlPlane` trait to enable resolution of collections to their current inferred schemas. Implement within `flowctl` through our API, and within `agent` via out-of-transaction SQL lookups done using the agent's Postgres pool. Update the `agent` discovers handler to map the existing `x-infer-schema: true` annotation into a synthesized initial read schema, that composes the write & inferred schemas. Issue #1103
It's been deprecated a while now, and no specs in production use it any longer.
Completed another pass of E2E testing using the source-postgres-batch connector, and it worked as expected (creating all inferred columns from the data captured thus-far). |
bed8485
to
543c258
Compare
Introduce new soon-to-be-well-known URIs
flow://write-schema
andflow://inferred-schema
which are available to collection read schemas.These URIs may be
$ref
'd from within a read schema, and having doneso, their definitions are inlined into the effective read schema of the
built collection with every publication.
Extend the
validation::ControlPlane
trait to enable resolution ofcollections to their current inferred schemas. Implement within
flowctl
through our API, and withinagent
via out-of-transaction SQLlookups done using the agent's Postgres pool.
Update the
agent
discovers handler to map the existingx-infer-schema: true
annotation into a synthesized initial readschema, that composes the write & inferred schemas.
Workflow steps:
Use a connector that has the
x-infer-schema: true
annotation.Then, every time its resulting collection is published (through a capture or materialization change), it's effective read schema is updated to the most-recent inferred schema of the collection.
If the nature of data changes, then the inferred schema also updates and a downstream use case (such as a materialization) fails with a schema violation. However, by next re-publishing the materialization (really, the collection, but we don't have an easy way to do this in the UI right now) the materialization is automatically fixed, and new columns are first added to the bound table before the first usage of that column is materialized.
Documentation links affected:
It doesn't appear we've got current docs on our janky schema inference service (which will be removed)? So it's not clear we have existing docs to update, but will need new docs on this feature.
Notes for reviewers:
Needs to be rebased on the "rust connectors: phase one" branch.
This change is