end-to-end continuous schema inference #1178

jgraettinger · 2023-09-06T05:49:20Z

Introduce new soon-to-be-well-known URIs flow://write-schema and flow://inferred-schema which are available to collection read schemas.

These URIs may be $ref'd from within a read schema, and having done
so, their definitions are inlined into the effective read schema of the
built collection with every publication.

Extend the validation::ControlPlane trait to enable resolution of
collections to their current inferred schemas. Implement within
flowctl through our API, and within agent via out-of-transaction SQL
lookups done using the agent's Postgres pool.

Update the agent discovers handler to map the existing
x-infer-schema: true annotation into a synthesized initial read
schema, that composes the write & inferred schemas.

Workflow steps:

Use a connector that has the x-infer-schema: true annotation.
Then, every time its resulting collection is published (through a capture or materialization change), it's effective read schema is updated to the most-recent inferred schema of the collection.

If the nature of data changes, then the inferred schema also updates and a downstream use case (such as a materialization) fails with a schema violation. However, by next re-publishing the materialization (really, the collection, but we don't have an easy way to do this in the UI right now) the materialization is automatically fixed, and new columns are first added to the bound table before the first usage of that column is materialized.

Documentation links affected:

It doesn't appear we've got current docs on our janky schema inference service (which will be removed)? So it's not clear we have existing docs to update, but will need new docs on this feature.

Notes for reviewers:

Needs to be rebased on the "rust connectors: phase one" branch.

This change is

Remove the migration shim which dynamically builds collections from their user-level specs. Update flowctl to directly retrieve built specs rather than the user-level spec.

psFried

Overall, this is looking pretty good to me. I left a few minor questions and comments which would be good to resolve prior to merging.

crates/validation/src/lib.rs

psFried · 2023-09-12T16:47:19Z

crates/agent/src/publications/builds.rs

@@ -62,7 +62,7 @@ impl BuildOutput {
            .iter()
            .map(|e| Error {
                scope: Some(e.scope.to_string()),
-                detail: e.error.to_string(),
+                detail: format!("{:#}", e.error),


TIL the alternate format will include causes 👍

Added a comment too.

psFried · 2023-09-12T17:09:19Z

crates/validation/src/collection.rs

+    // Add a definition for an inferred schema if it's provided.
+    // Note that we previously filtered the set of retrieved schemas to those
+    // having a read schema matching super::REF_INFERRED_SCHEMA_PATTERN.
+    if let Some(inferred_bundle) = inferred_bundle {


What happens if there is no inferred schema yet because this is a new collection? For example, what would happen if I published:

collections: acmeCo/some/new/collection: writeSchema: properties: id: { type: string} required: [id] readSchema: allOf: - $ref: flow://write-schema - $ref: flow://inferred-schema key: [/id]

It seems like it'd fail to resolve the inferred schema ref. I'm wondering if we should use some default placeholder (true?) if there isn't an inferred schema yet?

The intent is definitely to use a placeholder {} (equivalent to true). I definitely had it wired up that way, and confirmed it works as intended in end-to-end testing, but now I'm doubting myself and wondering if i screwed up a refactor. I'll double-check.

ETA: Yeeeep, I goofed this in a refactor. Fixing and adding more test cases...

psFried · 2023-09-12T17:31:13Z

crates/validation/src/lib.rs

+//   string would be quote-escaped.
+// * It must be a schema keyword ($ref cannot be, say, a property) because
+//   "flow://inferred-schema" is not a valid JSON schema and would error at build time.
+const REF_INFERRED_SCHEMA_PATTERN: &str = "\"$ref\":\"flow://inferred-schema\"";


This is sensitive to whitespace around the :, and it took me a while to prove to myself that there aren't any code paths that could result in a "$ref" : "flow://inferred-schema". I think this works because Loader::load_resource_content will always re-serialize the schema without whitespace. Assuming that's correct, I think it at least warrants a comment.

I'll suggest going just a little further and factoring this out into fn uses_*_schema_ref(schema: &str) -> bool instead of using schema.contains(REF_*), which can carry a comment about schema serialization.

You're right, it felt brittle. I've moved this into members of models::Schema, using whitespace-invarant regex's under the hood.

crates/validation/src/lib.rs

jgraettinger · 2023-09-12T20:03:08Z

I'm concurrently doing another round of end-to-end testing, but I think this is right. PTAL

github-actions · 2023-09-12T20:04:04Z

PR Preview Action v1.4.4
🚀 Deployed preview to https://estuary.github.io/flow/pr-preview/pr-1178/
on branch `gh-pages` at 2023-09-12 20:49 UTC

Introduce well-known URIs `flow://write-schema` and `flow://inferred-schema` which are available to collection read schemas. These URIs may be `$ref`'d from within a read schema, and having done so, their definitions are inlined into the effective read schema of the built collection with every publication. Extend the `validation::ControlPlane` trait to enable resolution of collections to their current inferred schemas. Implement within `flowctl` through our API, and within `agent` via out-of-transaction SQL lookups done using the agent's Postgres pool. Update the `agent` discovers handler to map the existing `x-infer-schema: true` annotation into a synthesized initial read schema, that composes the write & inferred schemas. Issue #1103

It's been deprecated a while now, and no specs in production use it any longer.

jgraettinger · 2023-09-12T20:43:19Z

Completed another pass of E2E testing using the source-postgres-batch connector, and it worked as expected (creating all inferred columns from the data captured thus-far).

jgraettinger force-pushed the johnny/infra-val branch 3 times, most recently from 1798ce0 to 76722bf Compare September 6, 2023 17:08

jgraettinger marked this pull request as ready for review September 6, 2023 17:12

jgraettinger added 2 commits September 12, 2023 04:00

rustfmt: remove because group_imports is not stable

efb3761

validation & flowctl: directly fetch built specs from the control-plane

2e36b11

Remove the migration shim which dynamically builds collections from their user-level specs. Update flowctl to directly retrieve built specs rather than the user-level spec.

jgraettinger force-pushed the johnny/infra-val branch from 76722bf to 28b021e Compare September 12, 2023 04:01

jgraettinger requested a review from psFried September 12, 2023 04:10

travjenkins mentioned this pull request Sep 12, 2023

Implement schema inference in the UI estuary/ui#740

Closed

psFried requested changes Sep 12, 2023

View reviewed changes

psFried approved these changes Sep 12, 2023

View reviewed changes

jgraettinger force-pushed the johnny/infra-val branch from 28b021e to bed8485 Compare September 12, 2023 20:01

jgraettinger added 2 commits September 12, 2023 20:21

cleanup: remove deprecated derivation property of collection spec

543c258

It's been deprecated a while now, and no specs in production use it any longer.

jgraettinger force-pushed the johnny/infra-val branch from bed8485 to 543c258 Compare September 12, 2023 20:46

jgraettinger merged commit aa94b7b into master Sep 12, 2023
3 checks passed

jgraettinger deleted the johnny/infra-val branch September 12, 2023 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

end-to-end continuous schema inference #1178

end-to-end continuous schema inference #1178

jgraettinger commented Sep 6, 2023 •

edited

Loading

psFried left a comment

psFried Sep 12, 2023

jgraettinger Sep 12, 2023

psFried Sep 12, 2023

jgraettinger Sep 12, 2023 •

edited

Loading

psFried Sep 12, 2023

jgraettinger Sep 12, 2023

jgraettinger commented Sep 12, 2023

github-actions bot commented Sep 12, 2023 •

edited

Loading

jgraettinger commented Sep 12, 2023

end-to-end continuous schema inference #1178

end-to-end continuous schema inference #1178

Conversation

jgraettinger commented Sep 6, 2023 • edited Loading

psFried left a comment

Choose a reason for hiding this comment

psFried Sep 12, 2023

Choose a reason for hiding this comment

jgraettinger Sep 12, 2023

Choose a reason for hiding this comment

psFried Sep 12, 2023

Choose a reason for hiding this comment

jgraettinger Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

psFried Sep 12, 2023

Choose a reason for hiding this comment

jgraettinger Sep 12, 2023

Choose a reason for hiding this comment

jgraettinger commented Sep 12, 2023

github-actions bot commented Sep 12, 2023 • edited Loading

jgraettinger commented Sep 12, 2023

jgraettinger commented Sep 6, 2023 •

edited

Loading

jgraettinger Sep 12, 2023 •

edited

Loading

github-actions bot commented Sep 12, 2023 •

edited

Loading