Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 Feature request: Postgres source #82

Open
WangCHEN9 opened this issue Feb 29, 2024 · 5 comments
Open

💡 Feature request: Postgres source #82

WangCHEN9 opened this issue Feb 29, 2024 · 5 comments

Comments

@WangCHEN9
Copy link

Will be really nice if We can support more database source connectors :)

@aaronsteers
Copy link
Contributor

aaronsteers commented Feb 29, 2024

@WangCHEN9 - Thanks for logging this. We're interested in learning more about your use case. Specifically:

  1. Do you want to replicate data from Postgres to another cache/destination, like Snowflake or a different Postgres DB? Or do you just want to get that data locally so it is available to your python code, in pandas/AI/etc.?
  2. For your use case, do you want to take advantage of built-in Potgres-native CDC features, such as auto-detecting new records with the WAL log (described here)? The alternative would be column-based incremental sync, for instance using an updated_at column or similar to detect new records.

@aaronsteers aaronsteers changed the title Support postgres source Feature request: Postgres source Feb 29, 2024
@WangCHEN9
Copy link
Author

Hi @aaronsteers ,

I have 2 main use cases in mind:

  • A seamless workflow using PyAirbyte/DBT/DuckDB for quick POCs locally, This will able to give the ELT power to data analysts.(they might not very good at python compare to data scientists, but enough to use PyAirbyte/DBT)
  • Use PyAirbyte as light weighted EL tool. (Here probably we will replicate to files in S3 first, before ingest it to Snowflake, so that we can switch with airbyte OSS/Enterprise later on)

For your questions :

  1. Yes, I am interested in replicate data from Postgres to S3 (with the help of DuckDB COPY function)
  2. I will prefer to use updated_at column for incremental loading new records. (it is easier for ingestion later on when you want load it as file)

Thanks,
Wang

@aaronsteers aaronsteers changed the title Feature request: Postgres source 💡 Feature request: Postgres source Mar 1, 2024
@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 1, 2024

@WangCHEN9 - Thanks very much for this explanation.

I've logged a couple different paths forwards. None of these approaches are trivial, unfortunately...

The most direct/obvious solution would be #87, but there are some technical barriers to us implementing this. There's another path forward in #85, which might be a smoother path for your use case. This 'cache-to-cache' implementation also has its own challenges, but those are more on us designing a good developer experience, less so on actual technical hurtles.

I noted in #87 a workaround which would be to pre-install the Java connector. Would love your thoughts and upvotes on any of those approaches. Thanks! 🙏

@WangCHEN9
Copy link
Author

Hi @aaronsteers ,

I will definitive upvote #85. Because it will able to unlock more much usecases, especially with the power of DuckDB.

For #87, Personally I don't like it. Asking user install java or docker is too much work. we kind of lost the advantage of PyAirbyte.

Wang

@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 4, 2024

@WangCHEN9 - This feedback is very helpful. Thank you!

Will keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants