Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into feature/jsonSchemaM…
Browse files Browse the repository at this point in the history
…erge
  • Loading branch information
jshearer committed Aug 7, 2023
2 parents 32399a6 + f63cf2c commit 79fd07b
Show file tree
Hide file tree
Showing 4 changed files with 33 additions and 6 deletions.
3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,6 @@
"deno.config": "./deno.jsonc",
"deno.importMap": "./supabase/functions/import-map.json",
"deno.lint": true,
"deno.path": ".build/package/bin/deno"
"deno.path": ".build/package/bin/deno",
"rust-analyzer.imports.group.enable": false
}
8 changes: 3 additions & 5 deletions crates/doc/src/schema.rs
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
use std::collections::{BTreeMap, BTreeSet};

use crate::inference::Shape;
use json::schema::{
self, keywords,
keywords,
types::{self, Set},
};
use schemars::{
gen::SchemaGenerator,
schema::{InstanceType, RootSchema, Schema, SchemaObject, SingleOrVec},
};
use serde_json::json;

use crate::inference::Shape;
use std::collections::{BTreeMap, BTreeSet};

#[derive(Debug, Default)]
pub struct SchemaBuilder {
Expand Down
1 change: 1 addition & 0 deletions rustfmt.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
group_imports = "One"
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# PostgreSQL Batch Query Connector

This connector captures data from Postgres into Flow collections by periodically
executing queries and translating the results into JSON documents.

We recommend using our [PostgreSQL CDC Connector](http://go.estuary.dev/source-postgres) instead
if possible. Using CDC provides lower latency data capture, delete and update events, and usually
has a smaller impact on the source database.

However there are some circumstances where this might not be feasible. Perhaps you need
to capture from a managed PostgreSQL instance which doesn't support logical replication.
Or perhaps you need to capture the contents of a view or the result of an ad-hoc query.
That's the sort of situation this connector is intended for.

The number one caveat you need to be aware of when using this connector is that **it will
periodically execute its update query over and over**. At the default polling interval of
5 minutes, a naive `SELECT * FROM foo` query against a 100 MiB view will produce 30 GiB/day
of ingested data, most of it duplicated.

This is why the connector's autodiscovery logic only returns ordinary tables of data, because
in that particular case we can use the `xmin` system column as a cursor and ask the database
to `SELECT xmin, * FROM foo WHERE xmin::text::bigint > $1;`.

If you start editing these queries or manually adding capture bindings for views or to run
ad-hoc queries, you need to either have some way of restricting the query to "just the new
rows since last time" or else have your polling interval set high enough that the data rate
`<DatasetSize> / <PollingInterval>` is an amount of data you're willing to deal with.

0 comments on commit 79fd07b

Please sign in to comment.