source-postgres-batch: Add very minimal documentation

This should be improved soon, but since we auto-open the page in a sidebar in our dashboard there needs to be *something* there immediately, and this is the bare minimum IMO.
estuary · Aug 4, 2023 · 7a75356 · 7a75356
1 parent d9b7826
commit 7a75356
Showing 1 changed file with 27 additions and 0 deletions.
diff --git a/site/docs/reference/Connectors/capture-connectors/postgres-batch.md b/site/docs/reference/Connectors/capture-connectors/postgres-batch.md
@@ -0,0 +1,27 @@
+# PostgreSQL Batch Query Connector
+
+This connector captures data from Postgres into Flow collections by periodically
+executing queries and translating the results into JSON documents.
+
+We recommend using our [PostgreSQL CDC Connector](http://go.estuary.dev/source-postgres) instead
+if possible. Using CDC provides lower latency data capture, delete and update events, and usually
+has a smaller impact on the source database.
+
+However there are some circumstances where this might not be feasible. Perhaps you need
+to capture from a managed PostgreSQL instance which doesn't support logical replication.
+Or perhaps you need to capture the contents of a view or the result of an ad-hoc query.
+That's the sort of situation this connector is intended for.
+
+The number one caveat you need to be aware of when using this connector is that **it will
+periodically execute its update query over and over**. At the default polling interval of
+5 minutes, a naive `SELECT * FROM foo` query against a 100 MiB view will produce 30 GiB/day
+of ingested data, most of it duplicated.
+
+This is why the connector's autodiscovery logic only returns ordinary tables of data, because
+in that particular case we can use the `xmin` system column as a cursor and ask the database
+to `SELECT xmin, * FROM foo WHERE xmin::text::bigint > $1;`.
+
+If you start editing these queries or manually adding capture bindings for views or to run
+ad-hoc queries, you need to either have some way of restricting the query to "just the new
+rows since last time" or else have your polling interval set high enough that the data rate
+`<DatasetSize> / <PollingInterval>` is an amount of data you're willing to deal with.