-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++ #41480
Comments
The main thing I don't understand about our current setup is in which case you would actually encounter the situation of one the cython extension modules not being built without it erroring during the build. Looking into when this logic was originally added (#132 and #194), it seems that at the time we were using |
Just a note that I think we have been doing this in the R package for quite some time (with apologies if I missed part of the nuance here): Lines 387 to 390 in 22f88fa
|
I think that we can remove the logic from The logic was borrowed from dynd-python: https://github.com/apache/arrow/pull/17/files#diff-eb8b42d9346d0a5d371facf21a8bfa2d16fb49e213ae7c21f03863accebe0fcfR83-R84 The check was for moving built extensions to proper locations: https://github.com/libdynd/dynd-python/blob/741d9d16c437f778d8c0d7259d4634c9af43962b/setup.py#L250-L266 The moving logic was already moved to I should have removed the logic in the PR. Sorry... |
…onents by default based on availability in Arrow C++
@kou I don't think it is about the moving logic, there was some additional logic to check if a certain cython extension module output existed (and this logic was only added after the initial implementation that was taken from dynd-python). Anyway, I put up a PR for this: #41494. Your review would be very welcome! |
The idea was probably to let people build a light-weight PyArrow even if a full-blown Arrow C++ is installed. That assumes people are interested in building PyArrow themselves while using a pre-built Arrow C++. I'm skeptical that this is a widespread use case. |
And that is still possible by setting |
Ah, ok. Well, that clears things up then. |
I think that pretty much all pyarrow developers have to do this? (Although pyarrow developers also might have a more limited C++ build that they use as a default). |
No (at least not me), I think typically pyarrow developers just build similar features sets for Arrow C++ and pyarrow (my default feature set I enable for Arrow C++ is indeed relatively limited). |
I was worried that adding the extra modules by default would impact the iteration time (e.g., edit some cython file, rebuild to see if it worked), but I just tested and it doesn't make much of a difference (compute is the only module that dominates the rebuild time time). |
The check logic also exists in dynd-python: https://github.com/libdynd/dynd-python/blob/741d9d16c437f778d8c0d7259d4634c9af43962b/setup.py#L265-L266 Anyway, I think that we don't need to do it here. We should do it as a general test. And we already has it. For example: arrow/ci/scripts/python_wheel_unix_test.sh Lines 64 to 88 in 49bf3d9
|
…row-build-config
Ah, you're right, the logic to raise an error when a module wasn't build already existed (and was indeed related to the moving). But I was talking just about the feature to allow suppressing this error. Sorry for the confusion ;) In any case, the PR is removing it. |
…row-build-config
…row-build-config
…row-build-config
… by default based on availability in Arrow C++ (#41494) ### Rationale for this change Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build. ### What changes are included in this PR? Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default. ### Are there any user-facing changes? No * GitHub Issue: #41480 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Issue resolved by pull request 41494 |
Thanks for doing this @jorisvandenbossche! |
…ents being enabled by default based on Arrow C++
Yes, I had started on that yesterday, but didn't yet open a PR. Did that now -> #41705 |
… itself is not enabled (fix Java integration build) (#41776) ### Rationale for this change Because of refactoring in #41480, explicitly enabling `PYARROW_WITH_PARQUET_ENCRYPTION` without enabling `PYARROW_WITH_PARQUET` (and without Arrow C++ being built with Parquet support) now raises an error, while before we checked in `setup.py` that both were enabled for enabling encryption support. This patch mimics that logic in CMakeLists.txt with a warning added. ### What changes are included in this PR? When PyArrow with Parquet Encryption is enabled but PyArrow with Parquet itself is not, ignore the encryption setting, but warn about it. ### Are these changes tested? Yes * GitHub Issue: #41725 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…onents by default based on availability in Arrow C++ (apache#41494) ### Rationale for this change Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build. ### What changes are included in this PR? Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default. ### Are there any user-facing changes? No * GitHub Issue: apache#41480 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…arquet itself is not enabled (fix Java integration build) (apache#41776) ### Rationale for this change Because of refactoring in apache#41480, explicitly enabling `PYARROW_WITH_PARQUET_ENCRYPTION` without enabling `PYARROW_WITH_PARQUET` (and without Arrow C++ being built with Parquet support) now raises an error, while before we checked in `setup.py` that both were enabled for enabling encryption support. This patch mimics that logic in CMakeLists.txt with a warning added. ### What changes are included in this PR? When PyArrow with Parquet Encryption is enabled but PyArrow with Parquet itself is not, ignore the encryption setting, but warn about it. ### Are these changes tested? Yes * GitHub Issue: apache#41725 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…eing enabled by default based on Arrow C++ (#41705) ### Rationale for this change Follow-up on #41494 to update the Python development guide to reflect the change in how PyArrow is build (defaults for the various `PYARROW_BUILD_<component>` are now set based on the `ARROW_<component>` setting. The current `PYARROW_WITH_<component>` environment variables are kept working to allow to override this default) * GitHub Issue: #41480 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Currently, when building pyarrow from source, one needs to manually enable the optional components through setting
PYARROW_WITH_...
environment variables. The example from the docs (https://arrow.apache.org/docs/dev/developers/python.html#build-and-test):Nowadays, if you enable something that isn't actually built in the Arrow C++ library you are building against, our python CMake will give an error. For example, I have build Arrow C++ without ARROW_ORC enabled, so when building pyarrow with
PYARROW_WITH_ORC=1
, I get:For development, I think it would be a lot more convenient to by default just enable those python components for which the equivalent Arrow C++ component was built. And we can still keep the option to override this default with the existing environment variables.
In #37822, @joemarshall implemented a solution for this because needing it for building pyarrow for emscripten / pyodide (although I would also just want this for the convenience).
See #37822 (comment) for some details, but the current solution there is to parse a json file generated by cmake to get those C++ build options into
setup.py
. @kou brought up that we could also do this insidepython/CMakeLists.txt
, however that is right now not possible because we use those variables insetup.py
.Looking a bit in more detail, it seems we only use those variables 1) to pass those to cmake, and 2) to raise a custom error if a certain cython extension module failed building.
_failure_permitted
function is using this, which is called from:arrow/python/setup.py
Lines 330 to 338 in 22f88fa
However, as I mentioned above, nowadays we already check for the equivalent Arrow C++ component being built from CMake, so I am not sure it is very common that a certain extension module would fail building when it should have been built (without this build error actually bubbling up already, because a failure from cython will just fail your build, without getting to this "failure permitted" check step).
If we remove this "failure_permitted" logic from setup.py, I think we can easily consolidate all component handling in
python/CMakeLists.txt
.Does somebody know the history behind the "failure_permitted" logic? Does it still have its use today? (cc @pitrou)
The text was updated successfully, but these errors were encountered: