Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed May 25, 2024
1 parent c4a1df5 commit 6710ec4
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 14 deletions.
4 changes: 4 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ bitcode
bitcodes
builtins
Builtins
CAS
clr
cuBLASLt
cuCtx
Expand Down Expand Up @@ -40,6 +41,7 @@ hcBLAS
icc
Interoperation
interoperate
Intrinsics
IPC
Lapack
latencies
Expand All @@ -66,9 +68,11 @@ preprocessor
PTX
queryable
representable
RMW
ROCm's
rocTX
RTC
RTTI
scalarizing
SIMT
SPMV
Expand Down
22 changes: 11 additions & 11 deletions docs/reference/kernel_language.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2096,7 +2096,7 @@ HIP does not support this type of scheduling.
Profiler Counter Function
============================================================

The CUDA `__prof_trigger()` instruction is not supported.
The CUDA ``__prof_trigger()`` instruction is not supported.

Assert
============================================================
Expand Down Expand Up @@ -2144,20 +2144,20 @@ Device-Side Dynamic Global Memory Allocation
Device-side dynamic global memory allocation is under development. HIP now includes a preliminary
implementation of malloc and free that can be called from device functions.

`__launch_bounds__`
``__launch_bounds__``
============================================================

GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance.

__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:
``__launch_bounds__`` allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:

.. code-block:: cpp
__global__ void __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_WARPS_PER_EXECUTION_UNIT)
MyKernel(hipGridLaunch lp, ...)
...
__launch_bounds__ supports two parameters:
``__launch_bounds__`` supports two parameters:
- MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
- MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.
Expand All @@ -2184,16 +2184,16 @@ CU and EU Definitions

A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.

Porting from CUDA `__launch_bounds`
Porting from CUDA ``__launch_bounds``
--------------------------------------------------------------------------------------------

CUDA defines a __launch_bounds which is also designed to control occupancy:
CUDA defines a ``__launch_bounds`` which is also designed to control occupancy:

.. code-block:: cpp
__launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
- The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).
- The second parameter ``__launch_bounds`` parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).

.. code-block:: cpp
Expand Down Expand Up @@ -2276,9 +2276,9 @@ GCN ISA In-line assembly, is supported. For example:
asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
We insert the GCN isa into the kernel using `asm()` Assembler statement.
`volatile` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
`v_mac_f32_e32` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
We insert the GCN isa into the kernel using ``asm()`` Assembler statement.
``volatile`` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
``v_mac_f32_e32`` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
Index for the respective operand in the ordered fashion is provided by `%` followed by position in the list of operands
`"v"` is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).
Expand All @@ -2293,7 +2293,7 @@ Virtual functions are not supported if objects containing virtual function table
Kernel Compilation
============================================================
hipcc now supports compiling C++/HIP kernels to binary code objects.
The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
The file format for binary is ``.co`` which means Code Object. The following command builds the code object using ``hipcc``.

.. code-block:: bash
Expand Down
6 changes: 3 additions & 3 deletions docs/understand/glossary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Glossary of terms

* **host**, **host cpu** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
* **host**, **host CPU** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
* **default device** : Each host thread maintains a default device.
Most HIP runtime APIs (including memory allocation, copy commands, kernel launches) do not accept an explicit device
argument but instead implicitly use the default device.
Expand All @@ -19,6 +19,6 @@ clr (https://github.com/ROCm/clr) contains the following three parts,

* **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).

* **hipconfig** - tool to report various configuration properties of the target platform.
* **`hipconfig`** - tool to report various configuration properties of the target platform.

* **nvcc** - NVIDIA CUDA `nvcc` compiler, do not capitalize.
* **`nvcc`** - NVIDIA CUDA `nvcc` compiler, do not capitalize.

0 comments on commit 6710ec4

Please sign in to comment.