-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDAX] Add modernized simpleP2P sample #2696
Conversation
05d5cb1
to
494459e
Compare
494459e
to
5174101
Compare
🟨 CI finished in 55m 42s: Pass: 99%/400 | Total: 2d 07h | Avg: 8m 22s | Max: 41m 59s | Hits: 87%/25829
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 400)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sample belongs in cudax/examples
, not examples/cudax
. see #2697. cc @alliepiper
// Disable peer access | ||
printf("Disabling peer access...\n"); | ||
dev0_resource.disable_peer_access(peers[1]); | ||
dev1_resource.disable_peer_access(peers[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? what happens if it's omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, it was in the original sample so I added it to just show how to do it. But we can remove it if its confusing why its there
Co-authored-by: Eric Niebler <[email protected]>
🟨 CI finished in 1h 38m: Pass: 99%/400 | Total: 1d 23h | Avg: 7m 10s | Max: 44m 19s | Hits: 90%/25829
|
Project | |
---|---|
+/- | CCCL Infrastructure |
libcu++ | |
CUB | |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
+/- | CCCL Infrastructure |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 400)
# | Runner |
---|---|
326 | linux-amd64-cpu16 |
31 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
std::vector<cudax::device_ref> find_peers_group() | ||
{ | ||
// Check possibility for peer access | ||
printf("\nChecking GPU(s) for support of peer to peer memory access...\n"); | ||
|
||
std::vector<cudax::device_ref> peers; | ||
for (auto& dev_i : cudax::devices) | ||
{ | ||
for (auto& dev_j : cudax::devices) | ||
{ | ||
if (dev_i != dev_j) | ||
{ | ||
bool can_access_peer = dev_i.is_peer_accessible_from(dev_j); | ||
// Save all peers of a first device found with a peer | ||
if (can_access_peer && peers.size() == 0) | ||
{ | ||
peers = dev_i.get_peers(); | ||
peers.insert(peers.begin(), dev_i); | ||
} | ||
printf("> Peer access from %s (GPU%d) -> %s (GPU%d) : %s\n", | ||
dev_i.get_name().c_str(), | ||
dev_i.get(), | ||
dev_j.get_name().c_str(), | ||
dev_j.get(), | ||
can_access_peer ? "Yes" : "No"); | ||
} | ||
} | ||
} | ||
|
||
return peers; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this could reuse device_ref::get_peers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is using get_peers
in line 75 to get the actual peers.
The original sample was printing out for every pair of devices if peer access is available, so I thought I should do it too. So I added the printing loop to show how you can access individual pairs, but also used get_peers
as a way to construct the resulting vector to show that API as well.
But if this is too obscure, I could also get rid of the printing loop or separate it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely separate it out. i'll also note that the printing loops are doing a cross-product, so this could be made linear with a cartesian_product
view. but i don't think we have such a thing yet.
🟨 CI finished in 51m 25s: Pass: 87%/54 | Total: 4h 41m | Avg: 5m 12s | Max: 18m 48s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
🟨 CI finished in 48m 26s: Pass: 96%/54 | Total: 4h 23m | Avg: 4m 52s | Max: 19m 52s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
🟩 CI finished in 26m 31s: Pass: 100%/54 | Total: 4h 21m | Avg: 4m 50s | Max: 20m 57s | Hits: 89%/240
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
🟩 CI finished in 23m 05s: Pass: 100%/54 | Total: 4h 33m | Avg: 5m 04s | Max: 18m 03s | Hits: 80%/240
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
* Add cuda::minimum and cuda::maximum
* Workaround non-copyable iterators * Use a named constant for SMEM * Cast to raw reference 2 * Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg
* Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: NVIDIA#2263
… available (NVIDIA#2712) * Ensure that we only use the inline variable trait when it is actually available * Use the right define for internal traits
…VIDIA#2710) * Rename the type * Update tests * Rename async memory pool * Rename the tests * Change name in the docs * Generalise the memory_pool_properties name * Fix docs --------- Co-authored-by: Michael Schellenberger Costa <[email protected]>
🟩 CI finished in 24m 57s: Pass: 100%/54 | Total: 4h 36m | Avg: 5m 07s | Max: 18m 52s | Hits: 78%/240
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
python | |
CCCL C Parallel Library | |
Catch2Helper |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
43 | linux-amd64-cpu16 |
5 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
* copy pasted sample * First draft * Kernel functor and some other things * Clean up and break up long main function * Needs launch fix * Switch to copy_bytes and cleanups * Missing include * Add exception print and waive value * Adjust copy count * Add license and switch benchmark streams * Remove a function left as a mistake * Update copyright date Co-authored-by: Eric Niebler <[email protected]> * Setup cudax examples. (NVIDIA#2697) * Move the sample to new location and fix warning * build fixes and 0 return code on waive * Some new MSVC errors * explicit cast * Rename enable/disable peer access and separate the sample loop * Add `cuda::minimum` and `cuda::maximum` (NVIDIA#2681) * Add cuda::minimum and cuda::maximum * Various fixes to cub::DeviceTransform (NVIDIA#2709) * Workaround non-copyable iterators * Use a named constant for SMEM * Cast to raw reference 2 * Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg * Make `thrust::transform` use `cub::DeviceTransform` (NVIDIA#2389) * Add transform benchmark requiring a stable address * Make thrust::transform use cub::DeviceTransform * Introduces address stability detection and opt-in in libcu++ * Mark lambdas in Thrust BabelStream benchmark address oblivious * Optimize prefetch cub::DeviceTransform for small problems Fixes: NVIDIA#2263 * Ensure that we only use the inline variable trait when it is actually available (NVIDIA#2712) * Ensure that we only use the inline variable trait when it is actually available * Use the right define for internal traits * [CUDAX] Rename memory resource and memory pool from async to device (NVIDIA#2710) * Rename the type * Update tests * Rename async memory pool * Rename the tests * Change name in the docs * Generalise the memory_pool_properties name * Fix docs --------- Co-authored-by: Michael Schellenberger Costa <[email protected]> * Update memory resource name --------- Co-authored-by: Eric Niebler <[email protected]> Co-authored-by: Allison Piper <[email protected]> Co-authored-by: Jacob Faibussowitsch <[email protected]> Co-authored-by: Bernhard Manfred Gruber <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]>
This is a sample with a very simple P2P memory transfer benchmark and showing P2P access from the kernel. This version is modernized to use P2P access management using memory resources and new
copy_bytes
to do the copies.There is a weird performance bug that I need to investigate, for now a workaround is to use stream on device 1 in the benchmark code.
Original sample link: https://github.com/NVIDIA/cuda-samples/blob/master/Samples/0_Introduction/simpleP2P/simpleP2P.cu