Using Airbyte for many connections on an incremental basis #312
stevenmurphy12
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi folks, I'm in the process of investigating PyAirbyte to create several hundred very similar pipelines.
We currently use Airbyte in conjunction with Dagster, with around 50 jobs at the moment. The process of creating those Airbyte connections manually and wiring into a Dagster job works, but I expect it will be overly cumbersome for our upcoming onboarding of several hundred similar pipelines.
The diagram below describes how I envisage we use PyAirbyte:
read
on a pre-existing custom connector we've built into a Docker image, parameterisedHaving tested this on one PyAirbyte connection (up to the dataframe transform at least), it appears to work fairly well. What I'd like to determine is how I can maintain state across separate invocations of the pipeline (so that next time the external API is called it's called with the appropriate
last_modified_from
parameter).With the code below I've used the default DuckDB cache which I assume to be ephemeral to a run, so wouldn't use the state next time the connector is called.
I've tried using the
BigQueryCache
(this was my first port of call, as BigQuery happened to be my destination). This appears to work OK for one connection (per dataset at least), but then the contents_airbyte_state
gets overridden by other PyAirbyte syncs.I'm wondering whether I'm looking at this the wrong way, and that I'm responsible for maintaining the state between runs (in Dagster or another mechanism, in lieu of using the Airbyte Platform).
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions