-
Notifications
You must be signed in to change notification settings - Fork 528
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
6 changed files
with
319 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,6 +80,7 @@ ROCm's | |
rocTX | ||
RTC | ||
RTTI | ||
SAXPY | ||
scalarizing | ||
sceneries | ||
SIMT | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,312 @@ | ||
.. meta:: | ||
:description: The tutorial of cooperative groups in HIP | ||
:keywords: AMD, ROCm, HIP, cooperative groups, tutorial | ||
|
||
******************************************************************************* | ||
Tutorial: Cooperative Groups | ||
******************************************************************************* | ||
|
||
This tutorial will show you the basic concepts of the cooperative groups in HIP | ||
programming model, the most essential tooling around it and briefly rehash some | ||
commonalities of heterogenous APIs in general. Mild familiarity with the C/C++ | ||
compilation model and the language is assumed throughout this article. | ||
|
||
Prerequisites | ||
============= | ||
|
||
In order to follow this tutorial you will need properly installed drivers and a | ||
HIP compiler toolchain to compile your code. Because HIP provided by ROCm | ||
supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs | ||
alike, the combination of install instructions are more then worth covering as | ||
part of this tutorial. Please refer to :doc:`/install/install` on how to | ||
install HIP development packages. | ||
|
||
Simple HIP Code | ||
=============== | ||
|
||
.. TODO: Add link here to SAXPY and subsections | ||
To get familiar with the Heterogenous programming, you should check the SAXPY | ||
tutorial and the first HIP code subsection. The compiling is also well described | ||
in that tutorial. | ||
|
||
Tiled partition | ||
=============== | ||
|
||
The tiled partition can be used to calculate the sum of ``partition_size`` | ||
length sequences and sum of ``result_size``/ ``BlockSize`` length sequences. The | ||
host side reference implementation is the following: | ||
|
||
.. code-block:: cpp | ||
// Host side function to perform the same reductions as executed on the GPU | ||
std::vector<unsigned int> ref_reduced(const unsigned int partition_size, | ||
std::vector<unsigned int> input) | ||
{ | ||
const unsigned int input_size = input.size(); | ||
const unsigned int result_size = input_size / partition_size; | ||
std::vector<unsigned int> result(result_size); | ||
for(unsigned int i = 0; i < result_size; i++) | ||
{ | ||
unsigned int partition_result = 0; | ||
for(unsigned int j = 0; j < partition_size; j++) | ||
{ | ||
partition_result += input[partition_size * i + j]; | ||
} | ||
result[i] = partition_result; | ||
} | ||
return result; | ||
} | ||
Device side code | ||
---------------- | ||
|
||
.. TODO: Add link here to reduction tutorial and subsections | ||
To be able to calculate the sum of the sets of numbers, the tutorial using the | ||
shared memory based reduction at device side. In this tutorial the warp level | ||
intrinsics usage not covered like in the reduction tutorial. The ``x`` input | ||
variable is a shared pointer, which needs to be synchronized after every value | ||
changes. The ``thread_group`` input parameter can be ``thread_block_tile`` or | ||
``thread_block`` also, because the ``thread_group`` is the parent class of these | ||
types. The ``val`` is the numbers, which we wants to calculate the sum of. The | ||
returned results of this function return the final results of the reduction on | ||
thread ID 0 of the ``thread_group`` and at every other threads the function | ||
results are 0. | ||
|
||
.. code-block:: cpp | ||
/// \brief Summation of `unsigned int val`'s in `thread_group g` using shared memory `x` | ||
__device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val) | ||
{ | ||
// Rank of this thread in the group | ||
const unsigned int group_thread_id = g.thread_rank(); | ||
// We start with half the group size as active threads | ||
// Every iteration the number of active threads halves, until we processed all values | ||
for(unsigned int i = g.size() / 2; i > 0; i /= 2) | ||
{ | ||
// Store value for this thread in a shared, temporary array | ||
x[group_thread_id] = val; | ||
// Synchronize all threads in the group | ||
g.sync(); | ||
// If our thread is still active, sum with its counterpart in the other half | ||
if(group_thread_id < i) | ||
{ | ||
val += x[group_thread_id + i]; | ||
} | ||
// Synchronize all threads in the group | ||
g.sync(); | ||
} | ||
// Only the first thread returns a valid value | ||
if(g.thread_rank() == 0) | ||
return val; | ||
else | ||
return 0; | ||
} | ||
The ``reduce_sum`` device function reused to calculate the block and custom | ||
partition sum of the input numbers. The kernel has three different sections: | ||
1. The reduction function variables initialization. | ||
2. The reduction of thread block and store the results in global memory. | ||
3. The reduction of custom partition and store the results in global memory. | ||
|
||
|
||
1. The reduction function variables initialization | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In this code section the shared memory is declared, the thread_block_group and | ||
custom_partition are defined and the input variable are loaded from global | ||
memory. | ||
|
||
.. code-block:: cpp | ||
// threadBlockGroup consists of all threads in the block | ||
thread_block thread_block_group = this_thread_block(); | ||
// Workspace array in shared memory required for reduction | ||
__shared__ unsigned int workspace[2048]; | ||
unsigned int output; | ||
// Input to reduce | ||
const unsigned int input = d_vector[thread_block_group.thread_rank()]; | ||
// ... | ||
// Every custom_partition group consists of 16 threads | ||
thread_block_tile<PartitionSize> custom_partition | ||
= tiled_partition<PartitionSize>(thread_block_group); | ||
2. The reduction of thread block | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In this code section the sum is calculate on thread_block_group level, then the | ||
results are stored in global memory. | ||
|
||
.. code-block:: cpp | ||
// Perform reduction | ||
output = reduce_sum(thread_block_group, workspace, input); | ||
// Only the first thread returns a valid value | ||
if(thread_block_group.thread_rank() == 0) | ||
{ | ||
d_block_reduced_vector[0] = output; | ||
} | ||
3. The reduction of custom partition | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In this code section the sum is calculate on thread_block_group level, then the | ||
results are stored in global memory. | ||
|
||
.. code-block:: cpp | ||
// Perform reduction | ||
output = reduce_sum(custom_partition, &workspace[group_offset], input); | ||
// Only the first thread in each partition returns a valid value | ||
if(custom_partition.thread_rank() == 0) | ||
{ | ||
const unsigned int partition_id = thread_block_group.thread_rank() / PartitionSize; | ||
d_partition_reduced_vector[partition_id] = output; | ||
} | ||
Host side code | ||
-------------- | ||
|
||
At the host side, you have to do the following steps: | ||
1. Check the cooperative group support on AMD GPUs. | ||
2. Initialize the cooperative group configuration. | ||
3. Allocate and copy input to global memory. | ||
4. Launch the cooperative kernel. | ||
5. Save the results from global memory. | ||
6. Free the global memory. | ||
|
||
1. Check the cooperative group support on AMD GPUs | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Not all the GPU supports cooparative groups, to make sure you should check with | ||
the following code: | ||
|
||
.. code-block:: cpp | ||
#ifdef __HIP_PLATFORM_AMD__ | ||
int device = 0; | ||
int supports_coop_launch = 0; | ||
// Check support | ||
// Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple devices | ||
HIP_CHECK(hipGetDevice(&device)); | ||
HIP_CHECK( | ||
hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch, device)); | ||
if(!supports_coop_launch) | ||
{ | ||
std::cout << "Skipping, device " << device << " does not support cooperative groups" | ||
<< std::endl; | ||
return 0; | ||
} | ||
#endif | ||
2. Initialize the cooperative group configuration | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. code-block:: cpp | ||
// Number of blocks to launch. | ||
constexpr unsigned int num_blocks = 1; | ||
// Number of threads in each kernel block. | ||
constexpr unsigned int threads_per_block = 64; | ||
// Total element count of the input vector. | ||
constexpr unsigned int size = num_blocks * threads_per_block; | ||
// Total elements count of a tiled_partition. | ||
constexpr unsigned int partition_size = 16; | ||
// Total size (in bytes) of the input vector. | ||
constexpr size_t size_bytes = sizeof(unsigned int) * size; | ||
static_assert(threads_per_block % partition_size == 0, | ||
"threads_per_block must be a multiple of partition_size"); | ||
3. Allocate and copy input to global memory | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. code-block:: cpp | ||
// Allocate device memory for the input and output matrices. | ||
unsigned int* d_vector{}; | ||
unsigned int* d_block_reduced{}; | ||
unsigned int* d_partition_reduced{}; | ||
HIP_CHECK(hipMalloc(&d_vector, size_bytes)); | ||
HIP_CHECK(hipMalloc(&d_block_reduced, sizeof(unsigned int) * h_block_reduced.size())); | ||
HIP_CHECK(hipMalloc(&d_partition_reduced, sizeof(unsigned int) * h_partition_reduced.size())); | ||
// Transfer the input vector to the device memory. | ||
HIP_CHECK(hipMemcpy(d_vector, h_vector.data(), size_bytes, hipMemcpyHostToDevice)); | ||
4. Launch the kernel | ||
~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Use the ``hipLaunchCooperativeKernel`` to be able to use cooperative groups. | ||
|
||
.. code-block:: cpp | ||
void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced}; | ||
// Launching kernel from host. | ||
HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel<partition_size>, | ||
dim3(num_blocks), | ||
dim3(threads_per_block), | ||
params, | ||
0, | ||
hipStreamDefault));\ | ||
// Check if the kernel launch was successful. | ||
HIP_CHECK(hipGetLastError()); | ||
5. Save the results from global memory | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
You can save the device result to host side ``std::vector`` with ``hipMemcpy()``. | ||
|
||
.. code-block:: cpp | ||
// Transfer the result back to the host. | ||
HIP_CHECK(hipMemcpy(h_block_reduced.data(), | ||
d_block_reduced, | ||
sizeof(unsigned int) * h_block_reduced.size(), | ||
hipMemcpyDeviceToHost)); | ||
HIP_CHECK(hipMemcpy(h_partition_reduced.data(), | ||
d_partition_reduced, | ||
sizeof(unsigned int) * h_partition_reduced.size(), | ||
hipMemcpyDeviceToHost)); | ||
6. Free the global memory | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Always clean up the global memory at the end of the application. | ||
|
||
.. code-block:: cpp | ||
// Free the resources on the device. | ||
HIP_CHECK(hipFree(d_vector)); | ||
HIP_CHECK(hipFree(d_block_reduced)); | ||
HIP_CHECK(hipFree(d_partition_reduced)); | ||
Conclusion | ||
---------- | ||
|
||
With cooperative groups you can use custom partition easily to be able to create | ||
custom tiles for custom solutions. The code in full length can be find at | ||
`cooperative groups ROCm example. <https://github.com/ROCm/rocm-examples/tree/develop/HIP-Basic/cooperative_groups>`_ |