-
-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate for CUDA 12.x (x to be determined) #6630
Comments
I've seen that some packages are somewhat named differently from 12.0 to 12.6. to help the transition. As I understand it, buliding with cuda 12.6 would allow running with 12.0 if all the package names match correct? |
If you meant the CUDA version of the enduser, then IIUC we need
Yeah, there was some reshuffling of various infra bits throughout the rollout, which is why I wanted to make this a migration and not just add that version to the pinning (i.e. why I was against #5613). |
is cupy special for some reason? |
If you look at the linked issue and its PR, this was added at a time when there was only 12.0 being built on the feedstock. I believe it is a bug in the metadata. AFAIU, using a build against 12.5 on a system with CUDA 12.0 is likely (though not guaranteed) to fail. CC @conda-forge/cupy |
There is “CUDA Enhanced Compatibility” which can allow building with 12.y and running with 12.x, (x < y) if the library and conda packaging support it. We wrote up a guide on this in the CUDA conda docs, with lots of links and sample code snippets: https://github.com/conda-forge/cuda-feedstock/blob/main/recipe/doc/recipe_guide.md#cuda-enhanced-compatibility By default, we do not know if a library is designed to work with earlier CUDA versions (based on what symbols / APIs it uses) so That can then be ignored if the maintainer knows their library’s compatibility is wider. This is how RAPIDS is packaged: we build with CUDA 12.5 (and CUDA 11.8) and allow runtimes of CUDA 12.0-12.5 (and CUDA 11.4-11.8). CuPy is the same (it ignores the compiler’s run exports). https://github.com/conda-forge/cupy-feedstock/blob/564d47cb293a8584d86294b9285ecbec2bffabdb/recipe/meta.yaml#L120 tl;dr migrating to building with CUDA 12.x would not necessarily break compatibility for all packages with earlier CUDA runtimes. By default, it would, for packages that do not have these run-exports ignored and replaced by their true set of compatible versions. But we can help make package maintainers aware of how to opt out of the stricter run exports and leverage CUDA Enhanced Compatibility where needed. |
Concretely, @h-vetinari is correct that the CUDA runtime provides sufficient ABI stability guarantees that if you build with 12.x and you run on 12.y for x < y, that will work. The converse is conditionally true: if you build with 12.y and run on 12.x (again x < y), that will work as long as you do not use any newer symbols. cudart symbols are stable across minor versions, but newer minor versions may introduce new symbols so as @bdice mentions it is incumbent on the library to ensure that no newer symbols are used. CUDA libraries (e.g. cuBlas) should provide similar guarantees to the runtime. Driver compatibility is separate: the driver promises binary compatibility across major releases as well as minor releases, again with the caveat that callers are responsible for putting in suitable guards to avoid using features only available in newer drivers when running on an older one. |
Thanks for confirming, this is what I had in mind (we're in the same situation with other ABI-stable libraries, which may still add new symbols in otherwise fully backward-compatible releases). Since I'm not aware if or how the use of newer symbols is actually controlled during CUDA builds my assumption was that use of newer symbols may happen (and relatedly, I believe it comes into play when compiling for new GPU arches that older CUDA versions don't support), and therefore that allowing "older versions than at build time" is fundamentally a risky proposition. I'd much rather we have separate builds for 11.8, 12.0 and 12.x, with respective constraints on CUDA like |
I think that typically makes sense. The migration doesn't constrain packages from choosing to opt in to building a library that is compatible across CUDA minor versions, nor does it prevent packages from setting up their runtime constraints so as to allow users to install those libraries on all compatible CUDA versions, right? Assuming both of those are true I would be fine with your plan. I would still want to preserve each library's ability to choose to build for greater compatibility, but I assume that can be done by overriding the CBC and .ci_support files on a per-feedstock basis regardless of what is pinned here. AFAICT the only thing piece of CUDA that is actually pinned here is the compiler. |
Yeah, pretty much everything can be overridden. The default should be safe, but if you know what you're doing, you can turn it off (e.g. ignore certain run-exports and set up your own constraints) |
Ubuntu 22.04 only reports I think Ubuntu 20.04 is stuck on nvidia driver 535. WIth the new info (thanks for the thoroughness), it might be good to have:
|
That's not needed. If you build with |
So the fact that we can install all the cuda stuff can be installed via conda, then we can simply have 12.0 and 12.max? and users on Ubuntu 22.04 will be fine? |
I don't see a reason to have both 12.0 and 12.max. Just 12.max will be fine.
yes |
For Ubuntu 20.04, newer NVIDIA driver versions (currently up to 565 branch) are available in the CUDA repository. |
That's an aspect of the whole thing that does make things confusing for sure. We do now have packages installable for newer CUDA pretty much everywhere, but there is a relationship to the driver, and I don't think we're capturing that information explicitly anywhere (it's probably implicit in the CUDA version?). It's actually more complicated still because NVIDA also offers a forward-looking I certainly don't pretend to understand the full set of constraints and relationships here...
In any case, if it works to only build for 12.max, all the better! |
I can summarize the "relationship to the driver" portion. It's not a significant constraint here. For CUDA 12, driver 525 was the first CUDA 12 driver released. So essentially all this means is "to run any CUDA 12.x, you need a driver that was released with 12.0 or later."
I agree it will probably suffice to build only |
I don't think a migration is necessary. Some of these feedstocks are just pinning the cuda-version at runtime unnecessarily, when they should just be trusting the strong run_exports from the cuda compiler package. For example, This pinning in nnpops is incorrect. If they want the cuda-version at test time to match the build environment, they should add the cuda-version package to the test/requires section instead of constraining all of their end users to use CTK 12.0 libraries. petsc also probably has redundant pinnings. These pinning are very similar and maybe the same as what is already exported by the compiler package. I think the only reason a package should need to build with cuda-version >12.0 is there is a feature in the cuda compiler that the library needs, but that's a library specific and not channel wide issue. If there is a bug fix in a CTK component library, then that library can be pinned separately from the cuda-version. |
It does become a wider issue if more and more packages require the newer compiler (whether due to features or bugs, e.g. with C++20 as mentioned in the pytorch issue), in the sense that we should have an easy solution for feedstocks to use a CUDA >12.0, rather than playing around with trying to override things correctly in a local CBC.
That's great! Thanks for the confirmation!
If only due to the build changes between 12.0, 12.1, 12.2 on the side of the conda-forge infrastructure (some reshuffles happened in the beginning, e.g. in which package certain binaries live), I think it would be a good idea. The migration could add 12.6 now, and once we're ready to close it, we could drop 12.0 and just keep 11.8 & 12.6. |
The thing is, that for many package, cuda builds are just flaky, long, and hard to get right (see pytorch work), and having 11.8 + 12.0 + 12.6 might just be more churn than the migration doing a full swap of 12.0 for 12.6. If the "migration" could "drop 12.0 and add 12.6" that would be great since we don't need to troubleshoot a "little value 12.0 build". |
This has been discussed several times in various places already for example, e.g. @jakirkham had proposed adding 12.4 in #5613 a while ago already. More and more, feedstocks are starting to depend on features (or bug fixes) from CUDA >12.0, so IMOwe should pick a version
x>=4
and migrate. Examples I'm aware of (+ some searching):The text was updated successfully, but these errors were encountered: