-
-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pyarrow #1058
Comments
In my own testing I've seen nearly an order of magnitude performance improvement when reading large datasets from SQL Server / Oracle into |
Hello, thank you for the use cases input. I guess first let's start from the pandas dataframes. I'm no pandas expert: what does people want to store: a binary blob or the data frame becomes a table? If you can provide some docs link that could be helpful. In general, adding specific adapters or cursor types should be supported (as it is for psycopg2 already). I think the biggest win here would be to pass binary data to/from the db, but I'd like to have first an idea of what data. |
The DataFrame (pyarrow.Table) would be inserted as rows into a postgres table (with a compatible schema) with the benefit being improved serialisation performance. For me, a huge drawcard is the performance improvement when reading data from a database as you can construct the e.g. with I guess ideally an Some more background on the performance improvements:
Documentation on the arrow columnar format and its C-data interface: |
I have working code that uses |
That's very exciting news @xochy! If @dvarrazzo does decide to support |
Would love to see this! Did some experimentation a over a year ago (https://github.com/mangecoeur/pgarrow - no guarantee that anything still works today, plus I'm rubbish at C/Cython) and there's at least a 3x performance improvement to be had. I think there is value in building into There is also https://github.com/heterodb/pg2arrow (in C) which converts between |
FYI: the Arrow C interface is a perfect match for a libpq bridge that doesn't require taking on any C++ library dependencies http://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/ Some other database engines are starting to look at this as a way to pass simple datasets to pyarrow at C call sites |
@xhochy did you end up open sourcing that code? |
Hi @xhochy, I would be interested in your code as well. Any news on that? |
Current methods to get pandas/polars/arrow dataframe from postgresql query result
Problems with the typical conventional approach:
What psycopg can greatly help in this case is to skip the c value to python object conversion. The basic approach would be
|
FYI: I'm working on a Parquet -> SQLalchemy tool that also uses pyarrow to batch insert data with |
The Postgres ADBC driver might also be useful in this space. |
Looks nice. I'll wait for it to have JSON, timezoned datetime, and array support before switching though. In the meantime I'm using my own repo, which underneath the hood uses PostgreSQL's binary protocol by way of https://github.com/altaurog/pgcopy |
pyarrow
is a high performance (zero copy) serialisation library designed for multi-language interop which has very fast conversion to/from pandas DataFrames.It would be great if
psycopg3
could also support returningpyarrow.Table
objects as well as Python tuples.The text was updated successfully, but these errors were encountered: