-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Format] Add Opaque canonical extension type #41823
Conversation
Updated NA -> Null, add "vendor_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for proposing this and putting together all of the language! I am excited to implement this in any drivers if/when it is available.
Is there any desire to include any other information that the producer has available? If some future Parquet standard allows for extension types it might make sense for the Parquet reader to pass on the extension metadata somehow (or maybe that should only be considered when the feature is first requested, which might be never?)
- The PostgreSQL ``polygon`` type may be represented as Other[binary] with | ||
metadata ``{"type_name": "polygon", "vendor_name": "PostgreSQL"}``. | ||
- The PostgreSQL ``point`` type may be represented as | ||
Other[fixed_size_binary[16]] with metadata | ||
``{"type_name": "point", "vendor_name": "PostgreSQL"}``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a more realistic example might be using the name "geometry"
, since this is what the PostGIS extension type would give you here (I'm not aware of widespread usage of the point or polygon types, although they are certainly fine for illustration purposes). apache/arrow-adbc#546 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the example.
In that case, wouldn't it just get mapped as a different extension type? I suppose I can see this being used as a generic "Parquet gave an extension type that I don't know how to deal with". I think we can add a new field at that point. |
Thoughts on using "Unknown" instead of "Other" ? It's not entirely clear to me what the motivation is to bucket those other types together in a single |
The set of unknown types isn't a closed set. |
I recently was looking into this concept in Parquet and apparently they use the term "undefined" for this there. I think "unknown" best communicates this concept (as in, this is not a type that is known to the implementation); however, I don't have strong feelings about the name.
For something like nanoarrow this might work; however, in Arrow C++ the extension type information is aggressively dropped, so it's very possible that that a pyarrow consumer would never be able to query the extension name to get the relevant information (what the type name is and where it came from). One could argue that also maybe Arrow C++ should stop aggressively dropping extension information, but even with that I think that this extension type is a better solution than an extension name that a database might not even know that it has until it receives a query. |
But also this extension type is not a closed set, as you can put whatever you want in the type_name and vendor_name metadata?
Then we should maybe consider if that behaviour in C++ is practical, otherwise this "arrow.other" type feels like a workaround for a limitation in the C++ API. |
To give a concrete example, to try to clarify:
In both cases, a producer of Arrow data is returning data with a data type that is not natively supported. So from an Arrow implementation (like Arrow C++) point of view, both cases seem equivalent. But so why would we treat the one case differently (assuming we add explicit support for "arrow.other" extension type) than the other? Maybe what I am after is some better guidelines for when data producers should use the "arrow.other" type vs a custom extension type name. |
Then you risk colliding with an actual extension type, and you aren't communicating that the intermediate system doesn't know how to interpret the type. Part of the use case is just in catalogs like Flight SQL where there isn't necessarily concrete data in play. GDAL wouldn't use it for types it knows about. I would argue the typical extension type is to claim support for a type that is not part of Arrow's type system with some particular semantics (UUID, or geometry, etc.). "Other" is different, it is to explicitly disclaim support while also avoiding a hard error or silent data loss. See the examples already written in the proposal. |
I think they are considering two different scenarios: an "extension type" is perhaps something that has a definition or specification somewhere, whereas when the Postgres driver encounters a type that it didn't know about at compile time, all ADBC has is I would also like to see pyarrow flag unknown extension types, but using that system for this use-case feels very hack-y to me. |
While I certainly understand the distinction of it being the intermediate system not knowing how to interpret the type (where in the example of GDAL the type is of course definitly known to GDAL), the "not knowing" part is bit fluid. One of the examples given is the PostgreSQL polygon type, which AFAIK is a built-in postgres type. The driver not knowing about this type seems to be a choice you make in the implementation (although disclaimer I am not familiar with the actual implementation). The postgres driver could perfectly know about the polygon type if you wanted? But I assume that is not done right now because there is no matching type on the Arrow side to map it to. Something else: since it seems this mostly comes from ADBC use cases, it would also be an option to use an "adbc.other" extension type with the described semantics? That avoids requiring a specific implementation for this on the Arrow side. |
Probably a better example would be a user-defined type (type alias for a built-in or extension type) or a type from a loaded extension (which the ADBC driver would never have an opportunity to know). I think the example was changed to
If extension type information wasn't dropped in multiple implementations this would probably work. Ensuring extension type registration in the presence of shared/static libraries and multiple languages is not trivial (as we know from considering this for geoarrow 🙂 ).
I think this issue comes up at any place data is converted to Arrow from something else. In R we (I) invented the undocumented |
|
|
||
An example of the extension metadata would be:: | ||
|
||
{"type_name": "geometry", "vendor_name": "PostGIS"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: JSON MUST use UTF-8: https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8.
Co-authored-by: Sutou Kouhei <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lidavidm for this proposal and Java component LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look through the C++ and Python implementations (as well as the definition in CanonicalExtensions.rst). Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format spec addition looks good to me!
For the Python bindings, looks good as well, just a few minor comments:
- We probably want to add
OpaqueScalar
topyarrow/__init__.py
as well? (although I see we forgot this for the tensor type, which I assume you have been imitating) - Can you add the three new classes to
test_extension_type_constructor_errors
intest_misc.py
? - Can you list the new type in the python API docs (docs/source/python/api/datatypes.rst, docs/source/python/api/arrays.rst
)? See eg #39652 for a previous PR adding a new type where this is added to the docs.
(FWIW I think we ideally would have a separate PR(s) for the implementations, so this Format PR (and what we vote on) is just the format change)
o.VendorName == rhs.VendorName | ||
} | ||
|
||
type OpaqueArray struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: godoc comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM!
std::string OpaqueType::ToString(bool show_metadata) const { | ||
std::stringstream ss; | ||
ss << "extension<" << this->extension_name() | ||
<< "[storage_type=" << storage_type_->ToString() << ", type_name=" << type_name_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show_metadata
passing into storage_type_->ToString()
?
@@ -22,6 +22,7 @@ | |||
#include "arrow/compute/kernels/scalar_cast_internal.h" | |||
#include "arrow/compute/kernels/util_internal.h" | |||
#include "arrow/scalar.h" | |||
#include "arrow/type_fwd.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why type_fwd is included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits from a review of the python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java LGTM!
Co-authored-by: Weston Pace <[email protected]>
I've split this into four issues/PRs: #43453 Thanks all for the feedback! |
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?