-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Makes UR cuda backend compatible with MPI #2077
Conversation
Without changing UR/oneapi specs. Signed-off-by: JackAKirk <[email protected]>
oneapi-src/unified-runtime#2077 Signed-off-by: JackAKirk <[email protected]>
tested here: |
Here is some more information on why this design choice was made, and comparison with alternatives. Firstly we start with the requirement that we fix intel/llvm#15251 Then we identify that the fix is to not allow Timestamp Requirement "Each profiling descriptor returns a 64-bit timestamp that represents the number of nanoseconds that have elapsed since some implementation-defined timebase. All events that share the same backend are guaranteed to share the same timebase, therefore the difference between two timestamps from the same backend yields the number of nanoseconds that have elapsed between those events." This is simply not implementable word for word (unless I am mistaken) in the cuda backend under the constraint that we are MPI compliant (in the following for brevity I will assume this constraint holds), for the following reasons:
Therefore we cannot create a "backend global" CUevent timestamp at backend/platform initialization. However, in practice I don't think this can break any SYCL code with the change in this PR because people can only perform profiling once they have created a At this point it is worth considering an alternative: imagine that we instead tie CUcontext instantialization to In addition to sycl::queue instantiating CUcontext, we would also have to make sycl::kernel_bundle instantiate a CUcontext, since it is not tied to the queue. Now if you did this, at least for the dpc++ implementation I think that all core sycl functionality would be valid since SYCL things require either a However even ignoring the above issues, the oneapi virtual memory extension does not take a queue argument, but its implementation requires setting CUcontext. This would mean we would have to add a failsafe CUcontext retention call as we have already had to make in this single device query case already: https://github.com/oneapi-src/unified-runtime/pull/2077/files#diff-641b75ae8137280ac68523353cbb6eb8059f8581b35261d7a96d179a478229bcR810. This would mean that we would have to have the similar possibility to that described above, whereby users use the virtual memory extension for a given device when a queue for that same device is not in scope leading to CUcontext initialization/destruction costs. Summary So there are 3 main reasons I see to prefer instantiation at context scope rather than queue/kernel_bundle scope:
A final general reason is that it would require more refactoring on the ur implementation compared to the suggested fix, beyond what I have already outlined would be required for reasons 1-3. In practice this doesn't change how users interact with sycl outside of MPI (or if they want to prevent devices being used for another reason). Single process users will be able to continue to use the default sycl::context and ignore sycl::context entirely. If they want to use MPI with only specific devices visible, then they just need to manually create the context as described in codeplaysoftware/SYCL-samples#33 Semantics wrt CUDA Runtime Essentially we swap the functionality of This also all fits in with the description of |
Taking account that CUevent are no longer valid if CUcontext creating them is destroyed. - Fix format. - Instantiate dummy vars with nullptr. - Pass EvBase by ref. Signed-off-by: JackAKirk <[email protected]>
332b461
to
351edba
Compare
I think that the device_num.cpp failure must be unrelated. I see it happening sometimes in other PRs. See: |
for (auto &Dev : Devices) { | ||
urDeviceRetain(Dev); | ||
Dev->retainNativeContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally prefer to see all of this logic for contexts happen in ur_queue_handle_t_
s. This avoids giving sycl::context
s extra semantics for the CUDA backend. Within urQueueCreate
you could call something like ur_device_handle_t_::init_device()
which would retain the primary ctx and then set the base event, which would then be cached in the device, so if another queue is created for the same device, it doesn't need to do the same base event getting, info querying, etc.
Let's see what @npmiller thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it in the queue would be nicer, it means we wouldn't need the user code changes dealing with sycl::context
since the queue already works with just one device. Not 100% sure if would work well with the event timing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I'll close this PR. I've described how I think it would work with event timings and any other edge case limitations when using the queue
/kernel_bundle
/etc to instantiate CUcontext
here: #2077 (comment)
An third alternative of creating it on demand and allowing the number of CUcontext
to grow to \infty and then relying on final dpc++ runtime to reset CUcontext
, which I think has similar edge case limitations in addition to the question of whether CUcontext
references can grow to \infty without consequences.
detail::ur::assertion(cuMemGetInfo(&FreeMemory, &TotalMemory) == | ||
CUDA_SUCCESS, | ||
"failed cuMemGetInfo() API."); | ||
UR_CHECK_ERROR(cuDevicePrimaryCtxRelease(hDevice->get())); | ||
return ReturnValue(FreeMemory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to remember there was multiple properties that needed a context, are you sure this is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm pretty sure:
sycl-ls tests all supported aspects by the test device that are exposed in dpc++. This includes examples of
cuDeviceGetPCIBusId
cuDeviceGetUuid_v2
orcuDeviceGetUuid
depending on toolkit versioncuDriverGetVersion
cuDeviceGetAttribute
: this is most casescuDeviceGetName
cuDeviceGetPCIBusId
cuDeviceTotalMem
is tested in another test without creating a context explicitly or implicitly: https://github.com/intel/llvm/blob/5cd9de100f8df3692b492e22d056e88798873ceb/sycl/test-e2e/syclcompat/device/device.cpp#L167
This leaves cuMemGetInfo
as the only one that fails without CUcontext
set from the sycl-ls test. This is the only api that takes a variable that depends on a CUcontext being set: I think the api tells you how much memory is left for the CUcontext to use.
We should be able to add an adapter test, to check that the context isn't active when doing just device queries. |
Fixes intel/llvm#15251
providing that MPI/SYCL codes are updated as in codeplaysoftware/SYCL-samples#33
I will follow up with a corresponding fix to hip that is pretty much copy paste; I can add it to this PR, but that would require #1830 to be merged first.
Background
The origin of the above issue is that since we no longer have a single cuda device per platform, we currently have all devices initialize primary CUcontexts at platform instantiation even when this was not required by the runtime.
The MPI interface can work with cuda/rocm awareness because it is assumed that the user will set only a single hip/cuda device per process prior to each MPI call. The problem with having multiple cuda devices set (i.e their primary CUcontexts are instantiated) (as is the case if you use e.g. the default sycl::context and have multiple devices included), is that MPI calls will operate on each instantiated device, in the case that at least one active process has such a device primary CUcontext set.
This can lead to memory leaks as described in intel/llvm#15251
This PR fixes the issue by removing this platform scope
CUcontext
instantiation, and promotingsycl::context
and ur context in the cuda backend to be responsible for instantiating and releasing the nativeCUcontext
s associated with all devices included in thatsycl::context
. This works due to the universal usage ofsycl::context
andur_context_handle_t_
in SYCL/UR apis. I think that this is the most natural solution for SYCL compatibility, and leads to only negligible one-time overhead in platform instantiation, but has the important benefit of supporting application critical technologies, MPI and *CCL, with SYCL.Without considering MPI compatibility
sycl::context
has been a free parameter (so to speak), that has not been functionally used in cuda/hip backends; it turns out that it is fortunate that it does exist, since otherwise there would not have been a straightforward way to incorporate MPI with SYCL as far as I can see.