Skip to content

Commit

Permalink
Add cooperative groups tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Jun 3, 2024
1 parent af4f735 commit 8c2b350
Show file tree
Hide file tree
Showing 6 changed files with 319 additions and 3 deletions.
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ ROCm's
rocTX
RTC
RTTI
SAXPY
scalarizing
sceneries
SIMT
Expand Down
3 changes: 2 additions & 1 deletion docs/how-to/cooperative_groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,8 @@ has to be used.
Group Types
=============

There are different group types based on different levels of grouping.
There are different group types based on different level of synchronization and
data sharing among threads.

Thread Block Group
--------------------
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ The CUDA enabled NVIDIA GPUs are supported by HIP. For more information, see [GP

* [HIP examples](https://github.com/ROCm/HIP-Examples)
* [HIP test samples](https://github.com/ROCm/hip-tests/tree/develop/samples)
* [Cooperative groups tutorial](./tutorials/cooperative_groups_tutorial)

:::

Expand Down
4 changes: 2 additions & 2 deletions docs/reference/cooperative_groups_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
:keywords: AMD, ROCm, HIP, cooperative groups

*******************************************************************************
Cooperative Groups
Cooperative Groups API
*******************************************************************************

The following functions are located in the https://github.com/ROCm/clr repository.
The following functions and classes are located in the https://github.com/ROCm/clr repository.

.. doxygenfunction:: cooperative_groups::this_multi_grid

Expand Down
1 change: 1 addition & 0 deletions docs/sphinx/_toc.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ subtrees:
title: HIP examples
- url: https://github.com/ROCm/hip-tests/tree/develop/samples
title: HIP test samples
- file: tutorials/cooperative_groups_tutorial

- caption: About
entries:
Expand Down
312 changes: 312 additions & 0 deletions docs/tutorials/cooperative_groups_tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,312 @@
.. meta::
:description: The tutorial of cooperative groups in HIP
:keywords: AMD, ROCm, HIP, cooperative groups, tutorial

*******************************************************************************
Tutorial: Cooperative Groups
*******************************************************************************

This tutorial will show you the basic concepts of the cooperative groups in HIP
programming model, the most essential tooling around it and briefly rehash some
commonalities of heterogenous APIs in general. Mild familiarity with the C/C++
compilation model and the language is assumed throughout this article.

Prerequisites
=============

In order to follow this tutorial you will need properly installed drivers and a
HIP compiler toolchain to compile your code. Because HIP provided by ROCm
supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs
alike, the combination of install instructions are more then worth covering as
part of this tutorial. Please refer to :doc:`/install/install` on how to
install HIP development packages.

Simple HIP Code
===============

.. TODO: Add link here to SAXPY and subsections
To get familiar with the Heterogenous programming, you should check the SAXPY
tutorial and the first HIP code subsection. The compiling is also well described
in that tutorial.

Tiled partition
===============

The tiled partition can be used to calculate the sum of ``partition_size``
length sequences and sum of ``result_size``/ ``BlockSize`` length sequences. The
host side reference implementation is the following:

.. code-block:: cpp
// Host side function to perform the same reductions as executed on the GPU
std::vector<unsigned int> ref_reduced(const unsigned int partition_size,
std::vector<unsigned int> input)
{
const unsigned int input_size = input.size();
const unsigned int result_size = input_size / partition_size;
std::vector<unsigned int> result(result_size);
for(unsigned int i = 0; i < result_size; i++)
{
unsigned int partition_result = 0;
for(unsigned int j = 0; j < partition_size; j++)
{
partition_result += input[partition_size * i + j];
}
result[i] = partition_result;
}
return result;
}
Device side code
----------------

.. TODO: Add link here to reduction tutorial and subsections
To be able to calculate the sum of the sets of numbers, the tutorial using the
shared memory based reduction at device side. In this tutorial the warp level
intrinsics usage not covered like in the reduction tutorial. The ``x`` input
variable is a shared pointer, which needs to be synchronized after every value
changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or
``thread_block`` also, because the ``thread_group`` is the parent class of these
types. The ``val`` is the numbers, which we wants to calculate the sum of. The
returned results of this function return the final results of the reduction on
thread ID 0 of the ``thread_group`` and at every other threads the function
results are 0.

.. code-block:: cpp
/// \brief Summation of `unsigned int val`'s in `thread_group g` using shared memory `x`
__device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val)
{
// Rank of this thread in the group
const unsigned int group_thread_id = g.thread_rank();
// We start with half the group size as active threads
// Every iteration the number of active threads halves, until we processed all values
for(unsigned int i = g.size() / 2; i > 0; i /= 2)
{
// Store value for this thread in a shared, temporary array
x[group_thread_id] = val;
// Synchronize all threads in the group
g.sync();
// If our thread is still active, sum with its counterpart in the other half
if(group_thread_id < i)
{
val += x[group_thread_id + i];
}
// Synchronize all threads in the group
g.sync();
}
// Only the first thread returns a valid value
if(g.thread_rank() == 0)
return val;
else
return 0;
}
The ``reduce_sum`` device function reused to calculate the block and custom
partition sum of the input numbers. The kernel has three different sections:
1. The reduction function variables initialization.
2. The reduction of thread block and store the results in global memory.
3. The reduction of custom partition and store the results in global memory.


1. The reduction function variables initialization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this code section the shared memory is declared, the thread_block_group and
custom_partition are defined and the input variable are loaded from global
memory.

.. code-block:: cpp
// threadBlockGroup consists of all threads in the block
thread_block thread_block_group = this_thread_block();
// Workspace array in shared memory required for reduction
__shared__ unsigned int workspace[2048];
unsigned int output;
// Input to reduce
const unsigned int input = d_vector[thread_block_group.thread_rank()];
// ...
// Every custom_partition group consists of 16 threads
thread_block_tile<PartitionSize> custom_partition
= tiled_partition<PartitionSize>(thread_block_group);
2. The reduction of thread block
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this code section the sum is calculate on thread_block_group level, then the
results are stored in global memory.

.. code-block:: cpp
// Perform reduction
output = reduce_sum(thread_block_group, workspace, input);
// Only the first thread returns a valid value
if(thread_block_group.thread_rank() == 0)
{
d_block_reduced_vector[0] = output;
}
3. The reduction of custom partition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this code section the sum is calculate on thread_block_group level, then the
results are stored in global memory.

.. code-block:: cpp
// Perform reduction
output = reduce_sum(custom_partition, &workspace[group_offset], input);
// Only the first thread in each partition returns a valid value
if(custom_partition.thread_rank() == 0)
{
const unsigned int partition_id = thread_block_group.thread_rank() / PartitionSize;
d_partition_reduced_vector[partition_id] = output;
}
Host side code
--------------

At the host side, you have to do the following steps:
1. Check the cooperative group support on AMD GPUs.
2. Initialize the cooperative group configuration.
3. Allocate and copy input to global memory.
4. Launch the cooperative kernel.
5. Save the results from global memory.
6. Free the global memory.

1. Check the cooperative group support on AMD GPUs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Not all the GPU supports cooparative groups, to make sure you should check with
the following code:

.. code-block:: cpp
#ifdef __HIP_PLATFORM_AMD__
int device = 0;
int supports_coop_launch = 0;
// Check support
// Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple devices
HIP_CHECK(hipGetDevice(&device));
HIP_CHECK(
hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch, device));
if(!supports_coop_launch)
{
std::cout << "Skipping, device " << device << " does not support cooperative groups"
<< std::endl;
return 0;
}
#endif
2. Initialize the cooperative group configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp
// Number of blocks to launch.
constexpr unsigned int num_blocks = 1;
// Number of threads in each kernel block.
constexpr unsigned int threads_per_block = 64;
// Total element count of the input vector.
constexpr unsigned int size = num_blocks * threads_per_block;
// Total elements count of a tiled_partition.
constexpr unsigned int partition_size = 16;
// Total size (in bytes) of the input vector.
constexpr size_t size_bytes = sizeof(unsigned int) * size;
static_assert(threads_per_block % partition_size == 0,
"threads_per_block must be a multiple of partition_size");
3. Allocate and copy input to global memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: cpp
// Allocate device memory for the input and output matrices.
unsigned int* d_vector{};
unsigned int* d_block_reduced{};
unsigned int* d_partition_reduced{};
HIP_CHECK(hipMalloc(&d_vector, size_bytes));
HIP_CHECK(hipMalloc(&d_block_reduced, sizeof(unsigned int) * h_block_reduced.size()));
HIP_CHECK(hipMalloc(&d_partition_reduced, sizeof(unsigned int) * h_partition_reduced.size()));
// Transfer the input vector to the device memory.
HIP_CHECK(hipMemcpy(d_vector, h_vector.data(), size_bytes, hipMemcpyHostToDevice));
4. Launch the kernel
~~~~~~~~~~~~~~~~~~~~

Use the ``hipLaunchCooperativeKernel`` to be able to use cooperative groups.

.. code-block:: cpp
void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced};
// Launching kernel from host.
HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel<partition_size>,
dim3(num_blocks),
dim3(threads_per_block),
params,
0,
hipStreamDefault));\
// Check if the kernel launch was successful.
HIP_CHECK(hipGetLastError());
5. Save the results from global memory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can save the device result to host side ``std::vector`` with ``hipMemcpy()``.

.. code-block:: cpp
// Transfer the result back to the host.
HIP_CHECK(hipMemcpy(h_block_reduced.data(),
d_block_reduced,
sizeof(unsigned int) * h_block_reduced.size(),
hipMemcpyDeviceToHost));
HIP_CHECK(hipMemcpy(h_partition_reduced.data(),
d_partition_reduced,
sizeof(unsigned int) * h_partition_reduced.size(),
hipMemcpyDeviceToHost));
6. Free the global memory
~~~~~~~~~~~~~~~~~~~~~~~~~

Always clean up the global memory at the end of the application.

.. code-block:: cpp
// Free the resources on the device.
HIP_CHECK(hipFree(d_vector));
HIP_CHECK(hipFree(d_block_reduced));
HIP_CHECK(hipFree(d_partition_reduced));
Conclusion
----------

With cooperative groups you can use custom partition easily to be able to create
custom tiles for custom solutions. The code in full length can be find at
`cooperative groups ROCm example. <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/cooperative_groups>`_

0 comments on commit 8c2b350

Please sign in to comment.