diff --git a/.wordlist.txt b/.wordlist.txt index 1819229c25..5063349d09 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -80,6 +80,7 @@ ROCm's rocTX RTC RTTI +SAXPY scalarizing sceneries SIMT diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/cooperative_groups.rst index feda2a6149..038b68a782 100644 --- a/docs/how-to/cooperative_groups.rst +++ b/docs/how-to/cooperative_groups.rst @@ -125,7 +125,8 @@ has to be used. Group Types ============= -There are different group types based on different levels of grouping. +There are different group types based on different level of synchronization and +data sharing among threads. Thread Block Group -------------------- diff --git a/docs/index.md b/docs/index.md index 094f29758c..c4dcf69fb0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -58,6 +58,7 @@ The CUDA enabled NVIDIA GPUs are supported by HIP. For more information, see [GP * [HIP examples](https://github.com/ROCm/HIP-Examples) * [HIP test samples](https://github.com/ROCm/hip-tests/tree/develop/samples) +* [Cooperative groups tutorial](./tutorials/cooperative_groups_tutorial) ::: diff --git a/docs/reference/cooperative_groups_reference.rst b/docs/reference/cooperative_groups_reference.rst index 0480ec488e..f0068a7ddf 100644 --- a/docs/reference/cooperative_groups_reference.rst +++ b/docs/reference/cooperative_groups_reference.rst @@ -5,10 +5,10 @@ :keywords: AMD, ROCm, HIP, cooperative groups ******************************************************************************* -Cooperative Groups +Cooperative Groups API ******************************************************************************* -The following functions are located in the https://github.com/ROCm/clr repository. +The following functions and classes are located in the https://github.com/ROCm/clr repository. .. doxygenfunction:: cooperative_groups::this_multi_grid diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index ea1454250b..6045e505c3 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -51,6 +51,7 @@ subtrees: title: HIP examples - url: https://github.com/ROCm/hip-tests/tree/develop/samples title: HIP test samples + - file: tutorials/cooperative_groups_tutorial - caption: About entries: diff --git a/docs/tutorials/cooperative_groups_tutorial.rst b/docs/tutorials/cooperative_groups_tutorial.rst new file mode 100644 index 0000000000..3448d15af5 --- /dev/null +++ b/docs/tutorials/cooperative_groups_tutorial.rst @@ -0,0 +1,312 @@ +.. meta:: + :description: The tutorial of cooperative groups in HIP + :keywords: AMD, ROCm, HIP, cooperative groups, tutorial + +******************************************************************************* +Tutorial: Cooperative Groups +******************************************************************************* + +This tutorial will show you the basic concepts of the cooperative groups in HIP +programming model, the most essential tooling around it and briefly rehash some +commonalities of heterogenous APIs in general. Mild familiarity with the C/C++ +compilation model and the language is assumed throughout this article. + +Prerequisites +============= + +In order to follow this tutorial you will need properly installed drivers and a +HIP compiler toolchain to compile your code. Because HIP provided by ROCm +supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs +alike, the combination of install instructions are more then worth covering as +part of this tutorial. Please refer to :doc:`/install/install` on how to +install HIP development packages. + +Simple HIP Code +=============== + +.. TODO: Add link here to SAXPY and subsections + +To get familiar with the Heterogenous programming, you should check the SAXPY +tutorial and the first HIP code subsection. The compiling is also well described +in that tutorial. + +Tiled partition +=============== + +The tiled partition can be used to calculate the sum of ``partition_size`` +length sequences and sum of ``result_size``/ ``BlockSize`` length sequences. The +host side reference implementation is the following: + +.. code-block:: cpp + + // Host side function to perform the same reductions as executed on the GPU + std::vector ref_reduced(const unsigned int partition_size, + std::vector input) + { + const unsigned int input_size = input.size(); + const unsigned int result_size = input_size / partition_size; + std::vector result(result_size); + + for(unsigned int i = 0; i < result_size; i++) + { + unsigned int partition_result = 0; + for(unsigned int j = 0; j < partition_size; j++) + { + partition_result += input[partition_size * i + j]; + } + result[i] = partition_result; + } + + return result; + } + +Device side code +---------------- + +.. TODO: Add link here to reduction tutorial and subsections + +To be able to calculate the sum of the sets of numbers, the tutorial using the +shared memory based reduction at device side. In this tutorial the warp level +intrinsics usage not covered like in the reduction tutorial. The ``x`` input +variable is a shared pointer, which needs to be synchronized after every value +changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or +``thread_block`` also, because the ``thread_group`` is the parent class of these +types. The ``val`` is the numbers, which we wants to calculate the sum of. The +returned results of this function return the final results of the reduction on +thread ID 0 of the ``thread_group`` and at every other threads the function +results are 0. + +.. code-block:: cpp + + /// \brief Summation of `unsigned int val`'s in `thread_group g` using shared memory `x` + __device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val) + { + // Rank of this thread in the group + const unsigned int group_thread_id = g.thread_rank(); + + // We start with half the group size as active threads + // Every iteration the number of active threads halves, until we processed all values + for(unsigned int i = g.size() / 2; i > 0; i /= 2) + { + // Store value for this thread in a shared, temporary array + x[group_thread_id] = val; + + // Synchronize all threads in the group + g.sync(); + + // If our thread is still active, sum with its counterpart in the other half + if(group_thread_id < i) + { + val += x[group_thread_id + i]; + } + + // Synchronize all threads in the group + g.sync(); + } + + // Only the first thread returns a valid value + if(g.thread_rank() == 0) + return val; + else + return 0; + } + +The ``reduce_sum`` device function reused to calculate the block and custom +partition sum of the input numbers. The kernel has three different sections: +1. The reduction function variables initialization. +2. The reduction of thread block and store the results in global memory. +3. The reduction of custom partition and store the results in global memory. + + +1. The reduction function variables initialization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this code section the shared memory is declared, the thread_block_group and +custom_partition are defined and the input variable are loaded from global +memory. + +.. code-block:: cpp + + // threadBlockGroup consists of all threads in the block + thread_block thread_block_group = this_thread_block(); + + // Workspace array in shared memory required for reduction + __shared__ unsigned int workspace[2048]; + + unsigned int output; + + // Input to reduce + const unsigned int input = d_vector[thread_block_group.thread_rank()]; + + // ... + + // Every custom_partition group consists of 16 threads + thread_block_tile custom_partition + = tiled_partition(thread_block_group); + + +2. The reduction of thread block +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this code section the sum is calculate on thread_block_group level, then the +results are stored in global memory. + +.. code-block:: cpp + + // Perform reduction + output = reduce_sum(thread_block_group, workspace, input); + + // Only the first thread returns a valid value + if(thread_block_group.thread_rank() == 0) + { + d_block_reduced_vector[0] = output; + } + +3. The reduction of custom partition +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this code section the sum is calculate on thread_block_group level, then the +results are stored in global memory. + +.. code-block:: cpp + + // Perform reduction + output = reduce_sum(custom_partition, &workspace[group_offset], input); + + // Only the first thread in each partition returns a valid value + if(custom_partition.thread_rank() == 0) + { + const unsigned int partition_id = thread_block_group.thread_rank() / PartitionSize; + d_partition_reduced_vector[partition_id] = output; + } + +Host side code +-------------- + +At the host side, you have to do the following steps: +1. Check the cooperative group support on AMD GPUs. +2. Initialize the cooperative group configuration. +3. Allocate and copy input to global memory. +4. Launch the cooperative kernel. +5. Save the results from global memory. +6. Free the global memory. + +1. Check the cooperative group support on AMD GPUs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Not all the GPU supports cooparative groups, to make sure you should check with +the following code: + +.. code-block:: cpp + + #ifdef __HIP_PLATFORM_AMD__ + int device = 0; + int supports_coop_launch = 0; + // Check support + // Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple devices + HIP_CHECK(hipGetDevice(&device)); + HIP_CHECK( + hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch, device)); + if(!supports_coop_launch) + { + std::cout << "Skipping, device " << device << " does not support cooperative groups" + << std::endl; + return 0; + } + #endif + +2. Initialize the cooperative group configuration +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Number of blocks to launch. + constexpr unsigned int num_blocks = 1; + + // Number of threads in each kernel block. + constexpr unsigned int threads_per_block = 64; + + // Total element count of the input vector. + constexpr unsigned int size = num_blocks * threads_per_block; + + // Total elements count of a tiled_partition. + constexpr unsigned int partition_size = 16; + + // Total size (in bytes) of the input vector. + constexpr size_t size_bytes = sizeof(unsigned int) * size; + + static_assert(threads_per_block % partition_size == 0, + "threads_per_block must be a multiple of partition_size"); + +3. Allocate and copy input to global memory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: cpp + + // Allocate device memory for the input and output matrices. + unsigned int* d_vector{}; + unsigned int* d_block_reduced{}; + unsigned int* d_partition_reduced{}; + HIP_CHECK(hipMalloc(&d_vector, size_bytes)); + HIP_CHECK(hipMalloc(&d_block_reduced, sizeof(unsigned int) * h_block_reduced.size())); + HIP_CHECK(hipMalloc(&d_partition_reduced, sizeof(unsigned int) * h_partition_reduced.size())); + + // Transfer the input vector to the device memory. + HIP_CHECK(hipMemcpy(d_vector, h_vector.data(), size_bytes, hipMemcpyHostToDevice)); + +4. Launch the kernel +~~~~~~~~~~~~~~~~~~~~ + +Use the ``hipLaunchCooperativeKernel`` to be able to use cooperative groups. + +.. code-block:: cpp + + void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced}; + // Launching kernel from host. + HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel, + dim3(num_blocks), + dim3(threads_per_block), + params, + 0, + hipStreamDefault));\ + + // Check if the kernel launch was successful. + HIP_CHECK(hipGetLastError()); + +5. Save the results from global memory +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +You can save the device result to host side ``std::vector`` with ``hipMemcpy()``. + +.. code-block:: cpp + + // Transfer the result back to the host. + HIP_CHECK(hipMemcpy(h_block_reduced.data(), + d_block_reduced, + sizeof(unsigned int) * h_block_reduced.size(), + hipMemcpyDeviceToHost)); + + HIP_CHECK(hipMemcpy(h_partition_reduced.data(), + d_partition_reduced, + sizeof(unsigned int) * h_partition_reduced.size(), + hipMemcpyDeviceToHost)); + +6. Free the global memory +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Always clean up the global memory at the end of the application. + +.. code-block:: cpp + + // Free the resources on the device. + HIP_CHECK(hipFree(d_vector)); + HIP_CHECK(hipFree(d_block_reduced)); + HIP_CHECK(hipFree(d_partition_reduced)); + + +Conclusion +---------- + +With cooperative groups you can use custom partition easily to be able to create +custom tiles for custom solutions. The code in full length can be find at +`cooperative groups ROCm example. `_