Using Airbyte for many connections on an incremental basis #312

stevenmurphy12 · 2024-07-30T11:07:55Z

stevenmurphy12
Jul 30, 2024

Hi folks, I'm in the process of investigating PyAirbyte to create several hundred very similar pipelines.

We currently use Airbyte in conjunction with Dagster, with around 50 jobs at the moment. The process of creating those Airbyte connections manually and wiring into a Dagster job works, but I expect it will be overly cumbersome for our upcoming onboarding of several hundred similar pipelines.

The diagram below describes how I envisage we use PyAirbyte:

Use Dagster to orchestrate the spin up of a K8s pod running PyAirbyte
Use PyAirbyte to call read on a pre-existing custom connector we've built into a Docker image, parameterised
Do some transformation on the resultant dataframe in Pandas
Save the result to an incremental BigQuery table

Having tested this on one PyAirbyte connection (up to the dataframe transform at least), it appears to work fairly well. What I'd like to determine is how I can maintain state across separate invocations of the pipeline (so that next time the external API is called it's called with the appropriate last_modified_from parameter).

With the code below I've used the default DuckDB cache which I assume to be ephemeral to a run, so wouldn't use the state next time the connector is called.

I've tried using the BigQueryCache (this was my first port of call, as BigQuery happened to be my destination). This appears to work OK for one connection (per dataset at least), but then the contents _airbyte_state gets overridden by other PyAirbyte syncs.

I'm wondering whether I'm looking at this the wrong way, and that I'm responsible for maintaining the state between runs (in Dagster or another mechanism, in lieu of using the Airbyte Platform).

Thank you!

import os

from airbyte import BigQueryCache, Source, get_source, ReadResult, progress, get_default_cache
from airbyte.progress import ProgressStyle

progress.progress.reset_progress_style(style=ProgressStyle.PLAIN)

# Create and install the source:
source: Source = get_source("my-custom-connector", docker_image="myrepo/my_connector@latest",)

source.set_config(
    config={
        "api_token": os.environ["an_api_token"],
        "customer_name": 'customer_1'
    },
)

print("Checking the config and credentials...")
source.check()

print("Discover")
source._discover()

print("source.discovered_catalog")
print(source.discovered_catalog)
print("sync")
source.select_streams(["my_stream"])

# Read into DuckDB local default cache
cache = get_default_cache()

read_result: ReadResult = source.read(cache=cache)

google_play_reviews_df = cache["my_stream"].to_pandas()

# Custom GCP Code here to append the result to a partitioned BigQuery table

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Airbyte for many connections on an incremental basis #312

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Using Airbyte for many connections on an incremental basis #312

stevenmurphy12 Jul 30, 2024

Replies: 0 comments

stevenmurphy12
Jul 30, 2024