Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++ #41480

Closed
jorisvandenbossche opened this issue May 1, 2024 · 15 comments
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently, when building pyarrow from source, one needs to manually enable the optional components through setting PYARROW_WITH_... environment variables. The example from the docs (https://arrow.apache.org/docs/dev/developers/python.html#build-and-test):

export PYARROW_WITH_PARQUET=1
export PYARROW_WITH_DATASET=1
export PYARROW_PARALLEL=4
python setup.py build_ext --inplace

Nowadays, if you enable something that isn't actually built in the Arrow C++ library you are building against, our python CMake will give an error. For example, I have build Arrow C++ without ARROW_ORC enabled, so when building pyarrow with PYARROW_WITH_ORC=1, I get:

CMake Error at CMakeLists.txt:652 (message):
  You must build Arrow C++ with ARROW_ORC=ON

For development, I think it would be a lot more convenient to by default just enable those python components for which the equivalent Arrow C++ component was built. And we can still keep the option to override this default with the existing environment variables.

In #37822, @joemarshall implemented a solution for this because needing it for building pyarrow for emscripten / pyodide (although I would also just want this for the convenience).
See #37822 (comment) for some details, but the current solution there is to parse a json file generated by cmake to get those C++ build options into setup.py. @kou brought up that we could also do this inside python/CMakeLists.txt, however that is right now not possible because we use those variables in setup.py.

Looking a bit in more detail, it seems we only use those variables 1) to pass those to cmake, and 2) to raise a custom error if a certain cython extension module failed building. _failure_permitted function is using this, which is called from:

arrow/python/setup.py

Lines 330 to 338 in 22f88fa

for name in self.CYTHON_MODULE_NAMES:
built_path = pjoin(install_prefix, name + ext_suffix)
if not os.path.exists(built_path):
print(f'Did not find {built_path}')
if self._failure_permitted(name):
print(f'Cython module {name} failure permitted')
continue
raise RuntimeError('PyArrow C-extension failed to build:',
os.path.abspath(built_path))

However, as I mentioned above, nowadays we already check for the equivalent Arrow C++ component being built from CMake, so I am not sure it is very common that a certain extension module would fail building when it should have been built (without this build error actually bubbling up already, because a failure from cython will just fail your build, without getting to this "failure permitted" check step).

If we remove this "failure_permitted" logic from setup.py, I think we can easily consolidate all component handling in python/CMakeLists.txt.

Does somebody know the history behind the "failure_permitted" logic? Does it still have its use today? (cc @pitrou)

@jorisvandenbossche
Copy link
Member Author

Does somebody know the history behind the "failure_permitted" logic? Does it still have its use today?

The main thing I don't understand about our current setup is in which case you would actually encounter the situation of one the cython extension modules not being built without it erroring during the build.
(maybe in the past things worked differently that required this logic?)

Looking into when this logic was originally added (#132 and #194), it seems that at the time we were using PARQUET_FOUND to decide to build pyarrow with Parquet or not, so when something went wrong there, we were just silently not building pyarrow.parquet.
But now we build this module opt-in based on PYARROW_BUILD_PARQUET, and in that case require that cmake finds Parquet lib and Arrow C++ is built with Parquet (ARROW_PARQUET is ON), and only then build the cython module. So my understanding is that any error of not having/finding libparquet as a failure in building the cython module will already fail loudly. Making this "failure_permitted" logic superfluous?

@paleolimbot
Copy link
Member

Just a note that I think we have been doing this in the R package for quite some time (with apologies if I missed part of the nuance here):

arrow/r/configure

Lines 387 to 390 in 22f88fa

arrow_built_with() {
# Function to check cmake options for features
grep -i 'set('"$1"' "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1
}

@kou
Copy link
Member

kou commented May 1, 2024

I think that we can remove the logic from python/setup.py.

The logic was borrowed from dynd-python: https://github.com/apache/arrow/pull/17/files#diff-eb8b42d9346d0a5d371facf21a8bfa2d16fb49e213ae7c21f03863accebe0fcfR83-R84

The check was for moving built extensions to proper locations: https://github.com/libdynd/dynd-python/blob/741d9d16c437f778d8c0d7259d4634c9af43962b/setup.py#L250-L266

The moving logic was already moved to python/CMakeLists.txt: #14925

I should have removed the logic in the PR. Sorry...

@jorisvandenbossche
Copy link
Member Author

@kou I don't think it is about the moving logic, there was some additional logic to check if a certain cython extension module output existed (and this logic was only added after the initial implementation that was taken from dynd-python).

Anyway, I put up a PR for this: #41494. Your review would be very welcome!

@pitrou
Copy link
Member

pitrou commented May 2, 2024

Does somebody know the history behind the "failure_permitted" logic? Does it still have its use today? (cc @pitrou)

The idea was probably to let people build a light-weight PyArrow even if a full-blown Arrow C++ is installed. That assumes people are interested in building PyArrow themselves while using a pre-built Arrow C++. I'm skeptical that this is a widespread use case.

@jorisvandenbossche
Copy link
Member Author

And that is still possible by setting PYARROW_WITH_...=0 regardless of the "failure_permitted" logic being removed

@pitrou
Copy link
Member

pitrou commented May 2, 2024

Ah, ok. Well, that clears things up then.

@paleolimbot
Copy link
Member

I'm skeptical that this is a widespread use case.

I think that pretty much all pyarrow developers have to do this? (Although pyarrow developers also might have a more limited C++ build that they use as a default).

@jorisvandenbossche
Copy link
Member Author

No (at least not me), I think typically pyarrow developers just build similar features sets for Arrow C++ and pyarrow (my default feature set I enable for Arrow C++ is indeed relatively limited).
But anyway, you don't "have" to do this, because even if you have an Arrow C++ with all features, pyarrow will still build just fine regardless of which pyarrow features you disable/enable (maybe I misunderstood your comment, though).

@paleolimbot
Copy link
Member

I was worried that adding the extra modules by default would impact the iteration time (e.g., edit some cython file, rebuild to see if it worked), but I just tested and it doesn't make much of a difference (compute is the only module that dominates the rebuild time time).

@kou
Copy link
Member

kou commented May 2, 2024

I don't think it is about the moving logic, there was some additional logic to check if a certain cython extension module output existed (and this logic was only added after the initial implementation that was taken from dynd-python).

The check logic also exists in dynd-python: https://github.com/libdynd/dynd-python/blob/741d9d16c437f778d8c0d7259d4634c9af43962b/setup.py#L265-L266

Anyway, I think that we don't need to do it here. We should do it as a general test. And we already has it. For example:

if [ "${CHECK_IMPORTS}" == "ON" ]; then
# Test that the modules are importable
python -c "
import pyarrow
import pyarrow._hdfs
import pyarrow.csv
import pyarrow.dataset
import pyarrow.fs
import pyarrow.json
import pyarrow.orc
import pyarrow.parquet
"
if [ "${PYARROW_TEST_GCS}" == "ON" ]; then
python -c "import pyarrow._gcsfs"
fi
if [ "${PYARROW_TEST_S3}" == "ON" ]; then
python -c "import pyarrow._s3fs"
fi
if [ "${PYARROW_TEST_FLIGHT}" == "ON" ]; then
python -c "import pyarrow.flight"
fi
if [ "${PYARROW_TEST_SUBSTRAIT}" == "ON" ]; then
python -c "import pyarrow.substrait"
fi
fi

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 3, 2024
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 3, 2024

Ah, you're right, the logic to raise an error when a module wasn't build already existed (and was indeed related to the moving). But I was talking just about the feature to allow suppressing this error. Sorry for the confusion ;)

In any case, the PR is removing it.

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 7, 2024
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 8, 2024
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue May 16, 2024
jorisvandenbossche added a commit that referenced this issue May 16, 2024
… by default based on availability in Arrow C++ (#41494)

### Rationale for this change

Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.

### What changes are included in this PR?

Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default.

### Are there any user-facing changes?

No
* GitHub Issue: #41480

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
@jorisvandenbossche jorisvandenbossche added this to the 17.0.0 milestone May 16, 2024
@jorisvandenbossche
Copy link
Member Author

Issue resolved by pull request 41494
#41494

@rok
Copy link
Member

rok commented May 16, 2024

Thanks for doing this @jorisvandenbossche!
Should development docs be updated to reflect the new behavior?

@jorisvandenbossche
Copy link
Member Author

Yes, I had started on that yesterday, but didn't yet open a PR. Did that now -> #41705

kou pushed a commit that referenced this issue May 22, 2024
… itself is not enabled (fix Java integration build) (#41776)

### Rationale for this change

Because of refactoring in #41480, explicitly enabling `PYARROW_WITH_PARQUET_ENCRYPTION` without enabling `PYARROW_WITH_PARQUET` (and without Arrow C++ being built with Parquet support) now raises an error, while before we checked in `setup.py` that both were enabled for enabling encryption support. This patch mimics that logic in CMakeLists.txt with a warning added.

### What changes are included in this PR?

When PyArrow with Parquet Encryption is enabled but PyArrow with Parquet itself is not, ignore the encryption setting, but warn about it.

### Are these changes tested?

Yes

* GitHub Issue: #41725

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…onents by default based on availability in Arrow C++ (apache#41494)

### Rationale for this change

Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.

### What changes are included in this PR?

Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41480

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…arquet itself is not enabled (fix Java integration build) (apache#41776)

### Rationale for this change

Because of refactoring in apache#41480, explicitly enabling `PYARROW_WITH_PARQUET_ENCRYPTION` without enabling `PYARROW_WITH_PARQUET` (and without Arrow C++ being built with Parquet support) now raises an error, while before we checked in `setup.py` that both were enabled for enabling encryption support. This patch mimics that logic in CMakeLists.txt with a warning added.

### What changes are included in this PR?

When PyArrow with Parquet Encryption is enabled but PyArrow with Parquet itself is not, ignore the encryption setting, but warn about it.

### Are these changes tested?

Yes

* GitHub Issue: apache#41725

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
jorisvandenbossche added a commit that referenced this issue Jun 13, 2024
…eing enabled by default based on Arrow C++ (#41705)

### Rationale for this change

Follow-up on #41494 to update the Python development guide to reflect the change in how PyArrow is build (defaults for the various `PYARROW_BUILD_<component>` are now set based on the `ARROW_<component>` setting. The current `PYARROW_WITH_<component>` environment variables are kept working to allow to override this default)

* GitHub Issue: #41480

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants