Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41480: [Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++ #41494

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 2, 2024

Rationale for this change

Currently, when building pyarrow from source, one needs to manually enable the optional components through setting PYARROW_WITH_... environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.

What changes are included in this PR?

Set defaults for the various PYARROW_BUILD_<component> based on the ARROW_<component> setting. Keep the current PYARROW_WITH_<component> environment variables working to allow to override this default.

Are there any user-facing changes?

No

…onents by default based on availability in Arrow C++
Copy link

github-actions bot commented May 2, 2024

⚠️ GitHub issue #41480 has been automatically assigned in GitHub to PR creator.

@jorisvandenbossche
Copy link
Member Author

I have not yet cleaned up all our CI scripts that contain things like export PYARROW_WITH_ACERO=${ARROW_ACERO:-OFF}, which is now in theory no longer necessary. Although we might also want to keep some of those ones that are explicitly set to 1 to ensure we don't accidentally turn it of because of an Arrow C++ config change.

find_package(Arrow REQUIRED)

# Top level cmake dir
if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou for my education, could you explain what this if check is exactly doing or why we need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for avoiding defining our options when this is used via add_subdirectory() (or FetchContent).

We don't need this because:

  • This is never used via add_subdirectory()
  • Our option definitions are harmless with add_subdirectory() use

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 2, 2024
Comment on lines +278 to +284
# enforce module dependencies
if(PYARROW_BUILD_SUBSTRAIT)
set(PYARROW_BUILD_DATASET ON)
endif()
if(PYARROW_BUILD_DATASET)
set(PYARROW_BUILD_ACERO ON)
endif()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure we keep the same logic as we have now in setup.py, but we could in theory also leave this out (and ensure the scripts don't explicitly disable Acero when enabling Dataset)

@jorisvandenbossche
Copy link
Member Author

@github-actions submit -g python

ci/scripts/python_build.sh Outdated Show resolved Hide resolved
python/CMakeLists.txt Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented May 2, 2024

One minor unfortunate consequence is that if you change your Arrow C++ build (eg enable a new component), you need to do a clean python build, otherwise CMake is still getting the old options (those option defaults are cached? Not sure if there is a way to solve this)

It would be nice if this could be avoided. @kou does that sound possible?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try and it works great!

python/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor unfortunate consequence is that if you change your Arrow C++ build (eg enable a new component), you need to do a clean python build, otherwise CMake is still getting the old options (those option defaults are cached? Not sure if there is a way to solve this)

Cache variables defined by option() are written in ${build_directory}/CMakeCache.txt. If we remove only the file (not entire build directory), the next ninja re-generates it automatically with new default values based on ARROW_*.

Or how about accepting ON/OFF/auto(default) as PYARROW_BUILD_XXX? If auto is specified, we use the default value:

set(PYARROW_BUILD_ACERO "auto" CACHE STRING "Build the PyArrow Acero integration")

if("${PYARROW_BUILD_ACERO}" STREQUAL "auto")
  set(PYARROW_BUILD_ACERO_REAL ${ARROW_ACERO})
else()
  if(PYARROW_BUILD_ACERO)
    set(PYARROW_BUILD_ACERO_REAL ON)
  else()
    set(PYARROW_BUILD_ACERO_REAL OFF)
  endif()
endif()

find_package(Arrow REQUIRED)

# Top level cmake dir
if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for avoiding defining our options when this is used via add_subdirectory() (or FetchContent).

We don't need this because:

  • This is never used via add_subdirectory()
  • Our option definitions are harmless with add_subdirectory() use

python/setup.py Outdated
Comment on lines 238 to 242
def maybe_append_cmake_bool(varname):
env_var = varname.replace("PYARROW_BUILD", "PYARROW_WITH")
if env_var in os.environ:
value = strtobool(os.environ.get(env_var))
append_cmake_bool(value, varname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about doing this in CMakeLists.txt too?
We can refer an environment variable by "$ENV{PYARROW_WITH_CUDA} in CMakeLists.txt.
See also: https://cmake.org/cmake/help/latest/variable/ENV.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that initially in a first version locally, but then moved it again to setup.py to avoid the same problem with it being cached. But with you suggestion above to use an AUTO default, then I can indeed move it back to the cmake.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 3, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit -g python

Copy link

github-actions bot commented May 3, 2024

Revision: f096999

Submitted crossbow builds: ursacomputing/crossbow @ actions-47aab0586a

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-20.04-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions

@jorisvandenbossche
Copy link
Member Author

One "feature" of our setup.py that I forgot about is that you actually can pass command line flags to setup.py's build_ext like python setup.py build_ext --with-parquet, and that also gets passed through to cmake right now.

Is that something we think we should keep working? It's nowhere used in our own codebase, and it is also not documented, except if you run python setup.py build_ext --help)

@jorisvandenbossche
Copy link
Member Author

It's nowhere used in our own codebase, and it is also not documented, except if you run python setup.py build_ext --help)

We did document it in the past, though (

Now, build pyarrow:
.. code-block:: shell
pushd arrow/python
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
--with-parquet --with-plasma --inplace
popd
), and so this probably is still used quite a bit in the wild (one of the first google searches eg gives this gist that uses it: https://gist.github.com/cpcloud/9702e1de6e234989d215ff00f3ab728b

I will add it back here for now, and then we can still see if we want to deprecate this separately or not.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 8, 2024
@jorisvandenbossche
Copy link
Member Author

@github-actions crossbow submit wheel-* python-sdist -python-

Copy link

github-actions bot commented May 8, 2024

Revision: f9b2b3c

Submitted crossbow builds: ursacomputing/crossbow @ actions-92fbc43a42

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
python-sdist GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-20.04-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
verify-rc-source-python-linux-almalinux-8-amd64 GitHub Actions
verify-rc-source-python-linux-conda-latest-amd64 GitHub Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 GitHub Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 GitHub Actions
verify-rc-source-python-macos-amd64 GitHub Actions
verify-rc-source-python-macos-arm64 GitHub Actions
verify-rc-source-python-macos-conda-amd64 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@jorisvandenbossche
Copy link
Member Author

@kou any other comments?
(I still need to remove some of the STATUS messages, but only want to do that when it is ready for merging, as that has been helpful while fixing/refactoring things)

@@ -123,6 +123,7 @@ repos:
?^go/.*/CMakeLists\.txt$|
?^java/.*/CMakeLists\.txt$|
?^matlab/.*/CMakeLists\.txt$|
?^python/CMakeLists\.txt$|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... These patterns are wrong...

Could you use ^XXX/.*CMakeLists\.txt$| for all existing patterns too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #41689 specifically for that

return True
return False
if os.path.exists(built_path):
self._found_names.append(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to prepare self._found_names?
It seems that nobody uses it.

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in get_names -> get_outputs methods. We don't use those ourselves, but I am assuming this might be part of the setuptools build_ext class interface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems that the setuptools build_ext doesn't have get_names interface:

>>> import setuptools.command.build_ext
>>> setuptools.command.build_ext.build_ext
<class 'setuptools.command.build_ext.build_ext'>
>>> 'get_names' in dir(setuptools.command.build_ext.build_ext)
False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does have get_outputs:

In [9]: 'get_outputs' in dir(setuptools.command.build_ext.build_ext)
Out[9]: True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I checked wrong name...

os.environ.get('PYARROW_WITH_ORC', '0'))
self.with_gandiva = strtobool(
os.environ.get('PYARROW_WITH_GANDIVA', '0'))
self.with_azure = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need self.with_XXX? It seems that nobody sets them because we remove os.environ.get(...).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are still used in case someone is passing that to setup.py. See #41494 (comment)

I think that's something we should deprecate, though. But planning to do a separate follow-up PR to that, since it's not actually related to this PR (it's just another way that someone can right now override the default, just as setting an environment variable)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we need them for --with-XXX options!
I see.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 14, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 15, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels May 16, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 16, 2024
@jorisvandenbossche jorisvandenbossche merged commit 1c546fb into apache:main May 16, 2024
12 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label May 16, 2024
@jorisvandenbossche jorisvandenbossche deleted the gh-41480-pyarrow-build-config branch May 16, 2024 12:16
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1c546fb.

There were 2 benchmark results with an error:

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…onents by default based on availability in Arrow C++ (apache#41494)

### Rationale for this change

Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.

### What changes are included in this PR?

Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default.

### Are there any user-facing changes?

No
* GitHub Issue: apache#41480

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
jorisvandenbossche added a commit that referenced this pull request Jun 13, 2024
…eing enabled by default based on Arrow C++ (#41705)

### Rationale for this change

Follow-up on #41494 to update the Python development guide to reflect the change in how PyArrow is build (defaults for the various `PYARROW_BUILD_<component>` are now set based on the `ARROW_<component>` setting. The current `PYARROW_WITH_<component>` environment variables are kept working to allow to override this default)

* GitHub Issue: #41480

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
raulcd added a commit that referenced this pull request Aug 1, 2024
### Rationale for this change

As mentioned in #41494 (comment) (while refactoring how to specify to the pyarrow build which components to build, i.e. to let it follow the Arrow C++ components by default), we do have a "feature" that you can specify which components to build directly to setup.py, like `python setup.py build_ext --with-parquet`. 

This is currently not used in our own codebase, and is also not documented anymore, but we did document it in the past.

In general calling setup.py directly is not recommended (although for development installs, it is still useful), furthermore there are alternatives to those flags (relying on Arrow C++ or setting an environment variable), and this would go away anyhow in case we would move away from setuptools at some point. 
So I think it is better to deprecate those options.

### What changes are included in this PR?

Whenever a user passes such a `--with-` flag, a warning is raised.

### Are these changes tested?

Tested it locally

### Are there any user-facing changes?

Only for developers building pyarrow from source, they have to potentially update their build instructions.
* GitHub Issue: #43514

Lead-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants