Add cooperative groups tutorial

ROCm · Jun 3, 2024 · 8c2b350 · 8c2b350
1 parent af4f735
commit 8c2b350
Show file tree

Hide file tree

Showing 6 changed files with 319 additions and 3 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -80,6 +80,7 @@ ROCm's
 rocTX
 RTC
 RTTI
+SAXPY
 scalarizing
 sceneries
 SIMT

diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/cooperative_groups.rst
@@ -125,7 +125,8 @@ has to be used.
 Group Types
 =============
 
-There are different group types based on different levels of grouping.
+There are different group types based on different level of synchronization and
+data sharing among threads. 
 
 Thread Block Group
 --------------------

diff --git a/docs/index.md b/docs/index.md
@@ -58,6 +58,7 @@ The CUDA enabled NVIDIA GPUs are supported by HIP. For more information, see [GP
 
 * [HIP examples](https://github.com/ROCm/HIP-Examples)
 * [HIP test samples](https://github.com/ROCm/hip-tests/tree/develop/samples)
+* [Cooperative groups tutorial](./tutorials/cooperative_groups_tutorial)
 
 :::
 

diff --git a/docs/reference/cooperative_groups_reference.rst b/docs/reference/cooperative_groups_reference.rst
@@ -5,10 +5,10 @@
   :keywords: AMD, ROCm, HIP, cooperative groups
 
 *******************************************************************************
-Cooperative Groups
+Cooperative Groups API
 *******************************************************************************
 
-The following functions are located in the https://github.com/ROCm/clr repository.
+The following functions and classes are located in the https://github.com/ROCm/clr repository.
 
 .. doxygenfunction:: cooperative_groups::this_multi_grid
 

diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
@@ -51,6 +51,7 @@ subtrees:
     title: HIP examples
   - url: https://github.com/ROCm/hip-tests/tree/develop/samples
     title: HIP test samples
+  - file: tutorials/cooperative_groups_tutorial
 
 - caption: About
   entries:

diff --git a/docs/tutorials/cooperative_groups_tutorial.rst b/docs/tutorials/cooperative_groups_tutorial.rst
@@ -0,0 +1,312 @@
+.. meta::
+  :description: The tutorial of cooperative groups in HIP
+  :keywords: AMD, ROCm, HIP, cooperative groups, tutorial
+
+*******************************************************************************
+Tutorial: Cooperative Groups
+*******************************************************************************
+
+This tutorial will show you the basic concepts of the cooperative groups in HIP
+programming model, the most essential tooling around it and briefly rehash some
+commonalities of heterogenous APIs in general. Mild familiarity with the C/C++
+compilation model and the language is assumed throughout this article.
+
+Prerequisites
+=============
+
+In order to follow this tutorial you will need properly installed drivers and a
+HIP compiler toolchain to compile your code. Because HIP provided by ROCm
+supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs
+alike, the combination of install instructions are more then worth covering as
+part of this tutorial. Please refer to :doc:`/install/install` on how to
+install HIP development packages.
+
+Simple HIP Code
+===============
+
+.. TODO: Add link here to SAXPY and subsections
+
+To get familiar with the Heterogenous programming, you should check the SAXPY
+tutorial and the first HIP code subsection. The compiling is also well described
+in that tutorial.
+
+Tiled partition
+===============
+
+The tiled partition can be used to calculate the sum of ``partition_size``
+length sequences and sum of ``result_size``/ ``BlockSize`` length sequences. The
+host side reference implementation is the following:
+
+.. code-block:: cpp
+
+  // Host side function to perform the same reductions as executed on the GPU
+  std::vector<unsigned int> ref_reduced(const unsigned int        partition_size,
+                                        std::vector<unsigned int> input)
+  {
+      const unsigned int        input_size  = input.size();
+      const unsigned int        result_size = input_size / partition_size;
+      std::vector<unsigned int> result(result_size);
+
+      for(unsigned int i = 0; i < result_size; i++)
+      {
+          unsigned int partition_result = 0;
+          for(unsigned int j = 0; j < partition_size; j++)
+          {
+              partition_result += input[partition_size * i + j];
+          }
+          result[i] = partition_result;
+      }
+
+      return result;
+  }
+
+Device side code
+----------------
+
+.. TODO: Add link here to reduction tutorial and subsections
+
+To be able to calculate the sum of the sets of numbers, the tutorial using the
+shared memory based reduction at device side. In this tutorial the warp level
+intrinsics usage not covered like in the reduction tutorial. The ``x`` input
+variable is a shared pointer, which needs to be synchronized after every value
+changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or
+``thread_block`` also, because the ``thread_group`` is the parent class of these
+types. The ``val`` is the numbers, which we wants to calculate the sum of. The
+returned results of this function return the final results of the reduction on
+thread ID 0 of the ``thread_group`` and at every other threads the function
+results are 0.
+
+.. code-block:: cpp
+
+  /// \brief Summation of `unsigned int val`'s in `thread_group g` using shared memory `x`
+  __device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val)
+  {
+      // Rank of this thread in the group
+      const unsigned int group_thread_id = g.thread_rank();
+
+      // We start with half the group size as active threads
+      // Every iteration the number of active threads halves, until we processed all values
+      for(unsigned int i = g.size() / 2; i > 0; i /= 2)
+      {
+          // Store value for this thread in a shared, temporary array
+          x[group_thread_id] = val;
+
+          // Synchronize all threads in the group
+          g.sync();
+
+          // If our thread is still active, sum with its counterpart in the other half
+          if(group_thread_id < i)
+          {
+              val += x[group_thread_id + i];
+          }
+
+          // Synchronize all threads in the group
+          g.sync();
+      }
+
+      // Only the first thread returns a valid value
+      if(g.thread_rank() == 0)
+          return val;
+      else
+          return 0;
+  }
+
+The ``reduce_sum`` device function reused to calculate the block and custom
+partition sum of the input numbers. The kernel has three different sections:
+1. The reduction function variables initialization.
+2. The reduction of thread block and store the results in global memory.
+3. The reduction of custom partition and store the results in global memory.
+
+
+1. The reduction function variables initialization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this code section the shared memory is declared, the thread_block_group and
+custom_partition are defined and the input variable are loaded from global
+memory.
+
+.. code-block:: cpp
+
+  // threadBlockGroup consists of all threads in the block
+  thread_block thread_block_group = this_thread_block();
+
+  // Workspace array in shared memory required for reduction
+  __shared__ unsigned int workspace[2048];
+
+  unsigned int output;
+
+  // Input to reduce
+  const unsigned int input = d_vector[thread_block_group.thread_rank()];
+
+  // ...
+
+  // Every custom_partition group consists of 16 threads
+  thread_block_tile<PartitionSize> custom_partition
+          = tiled_partition<PartitionSize>(thread_block_group);
+
+
+2. The reduction of thread block
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this code section the sum is calculate on thread_block_group level, then the
+results are stored in global memory.
+
+.. code-block:: cpp
+
+  // Perform reduction
+  output = reduce_sum(thread_block_group, workspace, input);
+
+  // Only the first thread returns a valid value
+  if(thread_block_group.thread_rank() == 0)
+  {
+    d_block_reduced_vector[0] = output;
+  }
+
+3. The reduction of custom partition
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this code section the sum is calculate on thread_block_group level, then the
+results are stored in global memory.
+
+.. code-block:: cpp
+
+  // Perform reduction
+  output = reduce_sum(custom_partition, &workspace[group_offset], input);
+
+  // Only the first thread in each partition returns a valid value
+  if(custom_partition.thread_rank() == 0)
+      {
+          const unsigned int partition_id          = thread_block_group.thread_rank() / PartitionSize;
+          d_partition_reduced_vector[partition_id] = output;
+      }
+
+Host side code
+--------------
+
+At the host side, you have to do the following steps:
+1. Check the cooperative group support on AMD GPUs.
+2. Initialize the cooperative group configuration.
+3. Allocate and copy input to global memory.
+4. Launch the cooperative kernel.
+5. Save the results from global memory.
+6. Free the global memory.
+
+1. Check the cooperative group support on AMD GPUs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not all the GPU supports cooparative groups, to make sure you should check with
+the following code:
+
+.. code-block:: cpp
+
+  #ifdef __HIP_PLATFORM_AMD__
+    int device               = 0;
+    int supports_coop_launch = 0;
+    // Check support
+    // Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple devices
+    HIP_CHECK(hipGetDevice(&device));
+    HIP_CHECK(
+        hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch, device));
+    if(!supports_coop_launch)
+    {
+        std::cout << "Skipping, device " << device << " does not support cooperative groups"
+                  << std::endl;
+        return 0;
+    }
+  #endif
+
+2. Initialize the cooperative group configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: cpp
+
+  // Number of blocks to launch.
+  constexpr unsigned int num_blocks = 1;
+
+  // Number of threads in each kernel block.
+  constexpr unsigned int threads_per_block = 64;
+
+  // Total element count of the input vector.
+  constexpr unsigned int size = num_blocks * threads_per_block;
+
+  // Total elements count of a tiled_partition.
+  constexpr unsigned int partition_size = 16;
+
+  // Total size (in bytes) of the input vector.
+  constexpr size_t size_bytes = sizeof(unsigned int) * size;
+
+  static_assert(threads_per_block % partition_size == 0,
+                "threads_per_block must be a multiple of partition_size");
+
+3. Allocate and copy input to global memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: cpp
+
+  // Allocate device memory for the input and output matrices.
+  unsigned int* d_vector{};
+  unsigned int* d_block_reduced{};
+  unsigned int* d_partition_reduced{};
+  HIP_CHECK(hipMalloc(&d_vector, size_bytes));
+  HIP_CHECK(hipMalloc(&d_block_reduced, sizeof(unsigned int) * h_block_reduced.size()));
+  HIP_CHECK(hipMalloc(&d_partition_reduced, sizeof(unsigned int) * h_partition_reduced.size()));
+
+  // Transfer the input vector to the device memory.
+  HIP_CHECK(hipMemcpy(d_vector, h_vector.data(), size_bytes, hipMemcpyHostToDevice));
+
+4. Launch the kernel
+~~~~~~~~~~~~~~~~~~~~
+
+Use the ``hipLaunchCooperativeKernel`` to be able to use cooperative groups.
+
+.. code-block:: cpp
+
+  void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced};
+  // Launching kernel from host.
+  HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel<partition_size>,
+                                       dim3(num_blocks),
+                                       dim3(threads_per_block),
+                                       params,
+                                       0,
+                                       hipStreamDefault));\
+
+  // Check if the kernel launch was successful.
+  HIP_CHECK(hipGetLastError());
+
+5. Save the results from global memory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can save the device result to host side ``std::vector`` with ``hipMemcpy()``.
+
+.. code-block:: cpp
+
+  // Transfer the result back to the host.
+  HIP_CHECK(hipMemcpy(h_block_reduced.data(),
+                      d_block_reduced,
+                      sizeof(unsigned int) * h_block_reduced.size(),
+                      hipMemcpyDeviceToHost));
+
+  HIP_CHECK(hipMemcpy(h_partition_reduced.data(),
+                      d_partition_reduced,
+                      sizeof(unsigned int) * h_partition_reduced.size(),
+                      hipMemcpyDeviceToHost));
+
+6. Free the global memory
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Always clean up the global memory at the end of the application.
+
+.. code-block:: cpp
+
+  // Free the resources on the device.
+  HIP_CHECK(hipFree(d_vector));
+  HIP_CHECK(hipFree(d_block_reduced));
+  HIP_CHECK(hipFree(d_partition_reduced));
+
+
+Conclusion
+----------
+
+With cooperative groups you can use custom partition easily to be able to create
+custom tiles for custom solutions. The code in full length can be find at 
+`cooperative groups ROCm example. <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/cooperative_groups>`_