-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41480: [Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++ #41494
GH-41480: [Python] Building PyArrow: enable/disable python components by default based on availability in Arrow C++ #41494
Conversation
…onents by default based on availability in Arrow C++
|
I have not yet cleaned up all our CI scripts that contain things like |
python/CMakeLists.txt
Outdated
find_package(Arrow REQUIRED) | ||
|
||
# Top level cmake dir | ||
if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou for my education, could you explain what this if
check is exactly doing or why we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for avoiding defining our options when this is used via add_subdirectory()
(or FetchContent
).
We don't need this because:
- This is never used via
add_subdirectory()
- Our option definitions are harmless with
add_subdirectory()
use
# enforce module dependencies | ||
if(PYARROW_BUILD_SUBSTRAIT) | ||
set(PYARROW_BUILD_DATASET ON) | ||
endif() | ||
if(PYARROW_BUILD_DATASET) | ||
set(PYARROW_BUILD_ACERO ON) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to ensure we keep the same logic as we have now in setup.py, but we could in theory also leave this out (and ensure the scripts don't explicitly disable Acero when enabling Dataset)
@github-actions submit -g python |
It would be nice if this could be avoided. @kou does that sound possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a try and it works great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor unfortunate consequence is that if you change your Arrow C++ build (eg enable a new component), you need to do a clean python build, otherwise CMake is still getting the old options (those option defaults are cached? Not sure if there is a way to solve this)
Cache variables defined by option()
are written in ${build_directory}/CMakeCache.txt
. If we remove only the file (not entire build directory), the next ninja
re-generates it automatically with new default values based on ARROW_*
.
Or how about accepting ON
/OFF
/auto
(default) as PYARROW_BUILD_XXX
? If auto
is specified, we use the default value:
set(PYARROW_BUILD_ACERO "auto" CACHE STRING "Build the PyArrow Acero integration")
if("${PYARROW_BUILD_ACERO}" STREQUAL "auto")
set(PYARROW_BUILD_ACERO_REAL ${ARROW_ACERO})
else()
if(PYARROW_BUILD_ACERO)
set(PYARROW_BUILD_ACERO_REAL ON)
else()
set(PYARROW_BUILD_ACERO_REAL OFF)
endif()
endif()
python/CMakeLists.txt
Outdated
find_package(Arrow REQUIRED) | ||
|
||
# Top level cmake dir | ||
if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for avoiding defining our options when this is used via add_subdirectory()
(or FetchContent
).
We don't need this because:
- This is never used via
add_subdirectory()
- Our option definitions are harmless with
add_subdirectory()
use
python/setup.py
Outdated
def maybe_append_cmake_bool(varname): | ||
env_var = varname.replace("PYARROW_BUILD", "PYARROW_WITH") | ||
if env_var in os.environ: | ||
value = strtobool(os.environ.get(env_var)) | ||
append_cmake_bool(value, varname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about doing this in CMakeLists.txt
too?
We can refer an environment variable by "$ENV{PYARROW_WITH_CUDA}
in CMakeLists.txt
.
See also: https://cmake.org/cmake/help/latest/variable/ENV.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that initially in a first version locally, but then moved it again to setup.py to avoid the same problem with it being cached. But with you suggestion above to use an AUTO default, then I can indeed move it back to the cmake.
@github-actions crossbow submit -g python |
Revision: f096999 Submitted crossbow builds: ursacomputing/crossbow @ actions-47aab0586a |
One "feature" of our setup.py that I forgot about is that you actually can pass command line flags to setup.py's build_ext like Is that something we think we should keep working? It's nowhere used in our own codebase, and it is also not documented, except if you run |
We did document it in the past, though ( arrow/docs/source/python/development.rst Lines 208 to 215 in 8ca4138
I will add it back here for now, and then we can still see if we want to deprecate this separately or not. |
…row-build-config
Co-authored-by: Sutou Kouhei <[email protected]>
…row-build-config
@github-actions crossbow submit wheel-* python-sdist -python- |
Revision: f9b2b3c Submitted crossbow builds: ursacomputing/crossbow @ actions-92fbc43a42 |
@kou any other comments? |
.pre-commit-config.yaml
Outdated
@@ -123,6 +123,7 @@ repos: | |||
?^go/.*/CMakeLists\.txt$| | |||
?^java/.*/CMakeLists\.txt$| | |||
?^matlab/.*/CMakeLists\.txt$| | |||
?^python/CMakeLists\.txt$| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... These patterns are wrong...
Could you use ^XXX/.*CMakeLists\.txt$|
for all existing patterns too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #41689 specifically for that
return True | ||
return False | ||
if os.path.exists(built_path): | ||
self._found_names.append(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to prepare self._found_names
?
It seems that nobody uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in get_names
-> get_outputs
methods. We don't use those ourselves, but I am assuming this might be part of the setuptools build_ext class interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. It seems that the setuptools build_ext
doesn't have get_names
interface:
>>> import setuptools.command.build_ext
>>> setuptools.command.build_ext.build_ext
<class 'setuptools.command.build_ext.build_ext'>
>>> 'get_names' in dir(setuptools.command.build_ext.build_ext)
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does have get_outputs
:
In [9]: 'get_outputs' in dir(setuptools.command.build_ext.build_ext)
Out[9]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. I checked wrong name...
os.environ.get('PYARROW_WITH_ORC', '0')) | ||
self.with_gandiva = strtobool( | ||
os.environ.get('PYARROW_WITH_GANDIVA', '0')) | ||
self.with_azure = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need self.with_XXX
? It seems that nobody sets them because we remove os.environ.get(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are still used in case someone is passing that to setup.py. See #41494 (comment)
I think that's something we should deprecate, though. But planning to do a separate follow-up PR to that, since it's not actually related to this PR (it's just another way that someone can right now override the default, just as setting an environment variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we need them for --with-XXX
options!
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1c546fb. There were 2 benchmark results with an error:
There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…onents by default based on availability in Arrow C++ (apache#41494) ### Rationale for this change Currently, when building pyarrow from source, one needs to manually enable the optional components through setting `PYARROW_WITH_...` environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build. ### What changes are included in this PR? Set defaults for the various `PYARROW_BUILD_<component>` based on the `ARROW_<component>` setting. Keep the current `PYARROW_WITH_<component>` environment variables working to allow to override this default. ### Are there any user-facing changes? No * GitHub Issue: apache#41480 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…eing enabled by default based on Arrow C++ (#41705) ### Rationale for this change Follow-up on #41494 to update the Python development guide to reflect the change in how PyArrow is build (defaults for the various `PYARROW_BUILD_<component>` are now set based on the `ARROW_<component>` setting. The current `PYARROW_WITH_<component>` environment variables are kept working to allow to override this default) * GitHub Issue: #41480 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
### Rationale for this change As mentioned in #41494 (comment) (while refactoring how to specify to the pyarrow build which components to build, i.e. to let it follow the Arrow C++ components by default), we do have a "feature" that you can specify which components to build directly to setup.py, like `python setup.py build_ext --with-parquet`. This is currently not used in our own codebase, and is also not documented anymore, but we did document it in the past. In general calling setup.py directly is not recommended (although for development installs, it is still useful), furthermore there are alternatives to those flags (relying on Arrow C++ or setting an environment variable), and this would go away anyhow in case we would move away from setuptools at some point. So I think it is better to deprecate those options. ### What changes are included in this PR? Whenever a user passes such a `--with-` flag, a warning is raised. ### Are these changes tested? Tested it locally ### Are there any user-facing changes? Only for developers building pyarrow from source, they have to potentially update their build instructions. * GitHub Issue: #43514 Lead-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
Rationale for this change
Currently, when building pyarrow from source, one needs to manually enable the optional components through setting
PYARROW_WITH_...
environment variables. However, we could also make a default choice of components based on which ones where enabled in the Arrow C++ build.What changes are included in this PR?
Set defaults for the various
PYARROW_BUILD_<component>
based on theARROW_<component>
setting. Keep the currentPYARROW_WITH_<component>
environment variables working to allow to override this default.Are there any user-facing changes?
No