Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/UUID timestamp extraction #1070

Closed
wants to merge 7 commits into from

Conversation

jshearer
Copy link
Contributor

@jshearer jshearer commented Jun 5, 2023

Description:

Customers were asking for the ability to include a timestamp column in their materializations that represents the point at which Flow saw that piece of data, effectively a "last modified" or "valid as of" field. We actually always had this data encoded as part of each document's UUID, and this PR allows you to get access to it via a "virtual projection" at the location /_meta/uuid/timestamp.

Workflow steps:

In order to materialize this timestamp, you have to:

  1. Create a projection referencing /_meta/uuid/timestamp
  2. Include that projection in your materialization's field selection

Unfortunately the UI does not support projections yet, so there is still some CLI work involved. It does technically support field selection by way of the "Advanced Specification Editor".

Given that, the workflow to materialize the meta uuid timestamp is:

  1. Pull down an existing capture: flowctl catalog pull-specs
  2. Add the projection:
    projections:
      meta_uuid_timestamp: /_meta/uuid/timestamp
  3. Publish your change: flowctl catalog publish
  4. Create a materialization for that collection
  5. Include your projection in the field selection inside the advanced spec editor:
    "bindings": [
        {
          "source": "my/collection/name",
          "resource": {
            "foo": "bar"
          },
          "fields": {
            "recommended": true,
            "include": {
              "meta_uuid_timestamp": {}
            }
          }
        }
    ],

If everything works, you should that field materialized! Example of materializing a hello world capture to postgres with the meta timestamp projection:

Screen Shot 2023-06-05 at 11 51 16

This change is Reviewable

@jshearer
Copy link
Contributor Author

jshearer commented Jun 5, 2023

@jgraettinger see here for my explanation of uuid_ptr and inject_uuid_placeholder:

// NOTE about UUID pointers:
// As a practical matter, materializations don’t ask the combiner to set a UUID
// only captures and derivations do. Since `inject_uuid_placeholder` implies that the
// UUID does not actually exist in the document yet, only when it's false (and `uuid_ptr` is Some)
// can we reasonably expect to resolve the virtual field <uuid_ptr>/timestamp,
// otherwise we'll just return null for that field.

@jgraettinger
Copy link
Member

FYI you can update Go snapshots by running UPDATE_SNAPSHOTS=1 make go-test-fast

@@ -1,3 +1,5 @@
use std::iter;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid extra space here. Have a single block of use statements, so that they're automatically sorted by the formatter.

@@ -11,6 +11,7 @@ license.workspace = true
[dependencies]
doc = { path = "../doc" }
json = { path = "../json" }
derive = { path = "../derive" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not thrilled with adding new dependency on derive. That crate is slatted to be taken out back to the woodshed...

..
}: &Param,
document: &N,
) -> rusqlite::Result<()> {
use derive::PointerExt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't love the introduced dependence on derive ... that crate is slated to be taken out behind the woodshed.

@@ -7,6 +7,8 @@ use proto_flow::flow;
pub struct Param {
// Projection which is being related as a SQLite parameter.
pub projection: flow::Projection,
// Collection which contains the projection that is being related as a SQLite parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the reason for Param is to provide separation from the details of CollectionSpec representation and the way we actually use parameters for SQLite.

I'm thinking we need to introduce a is_uuid_timestamp boolean on flow::Projection.
Then, there would be projections like field: _meta/uuid/timestamp (or something custom) ptr: /_meta/uuid is_uuid_timestamp: true

That allows resolution of all this stuff from only the flow::Projection instance, and we don't need to put logic for UUID_PTR+'/timestamp' anywhere except the validation logic that makes a flow::Projection.

For this crate specifically, derive::PointerExt is backed out and instead, within the existing :

 Some(Node::String(s)) => {
           if *is_format_integer {

block, you also test for is_uuid_timestamp and extract it accordingly.

return None
};

match v_uuid.as_node() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I'm thinking the bulk of this should live in flow-proto -- perhaps an impl of flow::UuidParts::from_uuid(uuid: uuid::Uuid) -> Self or even impl From<uuid::Uuid> for flow::UuidParts etc would definitely make sense there.

@@ -93,13 +104,15 @@ impl cgo::Service for API {
schema_json,
key_ptrs,
field_ptrs,
uuid_placeholder_ptr,
uuid_ptr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extrapolating from my comment above that is_uuid_timestamp belongs in flow::Projection, that implies that we need to pass flow::Projection instances in for key_ptrs and field_ptrs instead of raw JSON pointers.

That's a shift, but I do think that's probably the right call. if there are other circumstances that arise where we want to condition extraction behavior then standardizing on the passed-around flow::Projection greatly simplifies things.

come to think of it, default values are another good example! drain_chunk uses it's own parsed Shape below, but the flow::Projection also bakes in the default value. And there was another circumstance this came up, i think the extract API, where the source schema shape couldn't easily be plumbed through...

@jshearer
Copy link
Contributor Author

Superseded by #1098

@jshearer jshearer closed this Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants