Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end-to-end continuous schema inference #1178

Merged
merged 4 commits into from
Sep 12, 2023
Merged

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Sep 6, 2023

Introduce new soon-to-be-well-known URIs flow://write-schema and flow://inferred-schema which are available to collection read schemas.

These URIs may be $ref'd from within a read schema, and having done
so, their definitions are inlined into the effective read schema of the
built collection with every publication.

Extend the validation::ControlPlane trait to enable resolution of
collections to their current inferred schemas. Implement within
flowctl through our API, and within agent via out-of-transaction SQL
lookups done using the agent's Postgres pool.

Update the agent discovers handler to map the existing
x-infer-schema: true annotation into a synthesized initial read
schema, that composes the write & inferred schemas.

Workflow steps:

Use a connector that has the x-infer-schema: true annotation.
Then, every time its resulting collection is published (through a capture or materialization change), it's effective read schema is updated to the most-recent inferred schema of the collection.

If the nature of data changes, then the inferred schema also updates and a downstream use case (such as a materialization) fails with a schema violation. However, by next re-publishing the materialization (really, the collection, but we don't have an easy way to do this in the UI right now) the materialization is automatically fixed, and new columns are first added to the bound table before the first usage of that column is materialized.

Documentation links affected:

It doesn't appear we've got current docs on our janky schema inference service (which will be removed)? So it's not clear we have existing docs to update, but will need new docs on this feature.

Notes for reviewers:

Needs to be rebased on the "rust connectors: phase one" branch.


This change is Reviewable

@jgraettinger jgraettinger force-pushed the johnny/infra-val branch 3 times, most recently from 1798ce0 to 76722bf Compare September 6, 2023 17:08
@jgraettinger jgraettinger marked this pull request as ready for review September 6, 2023 17:12
Remove the migration shim which dynamically builds collections from
their user-level specs.

Update flowctl to directly retrieve built specs rather than the
user-level spec.
Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is looking pretty good to me. I left a few minor questions and comments which would be good to resolve prior to merging.

crates/validation/src/lib.rs Outdated Show resolved Hide resolved
@@ -62,7 +62,7 @@ impl BuildOutput {
.iter()
.map(|e| Error {
scope: Some(e.scope.to_string()),
detail: e.error.to_string(),
detail: format!("{:#}", e.error),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL the alternate format will include causes 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment too.

// Add a definition for an inferred schema if it's provided.
// Note that we previously filtered the set of retrieved schemas to those
// having a read schema matching super::REF_INFERRED_SCHEMA_PATTERN.
if let Some(inferred_bundle) = inferred_bundle {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is no inferred schema yet because this is a new collection? For example, what would happen if I published:

collections:
  acmeCo/some/new/collection:
    writeSchema:
      properties:
        id: { type: string}
      required: [id]
    readSchema:
      allOf:
        - $ref: flow://write-schema
        - $ref: flow://inferred-schema
    key: [/id]

It seems like it'd fail to resolve the inferred schema ref. I'm wondering if we should use some default placeholder (true?) if there isn't an inferred schema yet?

Copy link
Member Author

@jgraettinger jgraettinger Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is definitely to use a placeholder {} (equivalent to true). I definitely had it wired up that way, and confirmed it works as intended in end-to-end testing, but now I'm doubting myself and wondering if i screwed up a refactor. I'll double-check.

ETA: Yeeeep, I goofed this in a refactor. Fixing and adding more test cases...

// string would be quote-escaped.
// * It must be a schema keyword ($ref cannot be, say, a property) because
// "flow://inferred-schema" is not a valid JSON schema and would error at build time.
const REF_INFERRED_SCHEMA_PATTERN: &str = "\"$ref\":\"flow://inferred-schema\"";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sensitive to whitespace around the :, and it took me a while to prove to myself that there aren't any code paths that could result in a "$ref" : "flow://inferred-schema". I think this works because Loader::load_resource_content will always re-serialize the schema without whitespace. Assuming that's correct, I think it at least warrants a comment.

I'll suggest going just a little further and factoring this out into fn uses_*_schema_ref(schema: &str) -> bool instead of using schema.contains(REF_*), which can carry a comment about schema serialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it felt brittle. I've moved this into members of models::Schema, using whitespace-invarant regex's under the hood.

crates/validation/src/lib.rs Show resolved Hide resolved
@jgraettinger
Copy link
Member Author

I'm concurrently doing another round of end-to-end testing, but I think this is right. PTAL

@github-actions
Copy link

github-actions bot commented Sep 12, 2023

PR Preview Action v1.4.4
🚀 Deployed preview to https://estuary.github.io/flow/pr-preview/pr-1178/
on branch gh-pages at 2023-09-12 20:49 UTC

Introduce well-known URIs `flow://write-schema` and
`flow://inferred-schema` which are available to collection read schemas.

These URIs may be `$ref`'d from within a read schema, and having done
so, their definitions are inlined into the effective read schema of the
built collection with every publication.

Extend the `validation::ControlPlane` trait to enable resolution of
collections to their current inferred schemas. Implement within
`flowctl` through our API, and within `agent` via out-of-transaction SQL
lookups done using the agent's Postgres pool.

Update the `agent` discovers handler to map the existing
`x-infer-schema: true` annotation into a synthesized initial read
schema, that composes the write & inferred schemas.

Issue #1103
It's been deprecated a while now, and no specs in production use it any
longer.
@jgraettinger
Copy link
Member Author

Completed another pass of E2E testing using the source-postgres-batch connector, and it worked as expected (creating all inferred columns from the data captured thus-far).

@jgraettinger jgraettinger merged commit aa94b7b into master Sep 12, 2023
3 checks passed
@jgraettinger jgraettinger deleted the johnny/infra-val branch September 12, 2023 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants