From 6710ec4c58b011b2aa5d63224dda8fcc2c06dd5a Mon Sep 17 00:00:00 2001
From: Istvan Kiss <neon60@gmail.com>
Date: Sat, 25 May 2024 20:26:24 +0200
Subject: [PATCH] WIP

---
 .wordlist.txt                      |  4 ++++
 docs/reference/kernel_language.rst | 22 +++++++++++-----------
 docs/understand/glossary.md        |  6 +++---
 3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/.wordlist.txt b/.wordlist.txt
index 1fc20183b5..48393a2031 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -10,6 +10,7 @@ bitcode
 bitcodes
 builtins
 Builtins
+CAS
 clr
 cuBLASLt
 cuCtx
@@ -40,6 +41,7 @@ hcBLAS
 icc
 Interoperation
 interoperate
+Intrinsics
 IPC
 Lapack
 latencies
@@ -66,9 +68,11 @@ preprocessor
 PTX
 queryable
 representable
+RMW
 ROCm's
 rocTX
 RTC
+RTTI
 scalarizing
 SIMT
 SPMV
diff --git a/docs/reference/kernel_language.rst b/docs/reference/kernel_language.rst
index ab47a7eec3..d09bc0f79a 100644
--- a/docs/reference/kernel_language.rst
+++ b/docs/reference/kernel_language.rst
@@ -2096,7 +2096,7 @@ HIP does not support this type of scheduling.
 Profiler Counter Function
 ============================================================
 
-The CUDA `__prof_trigger()` instruction is not supported.
+The CUDA ``__prof_trigger()`` instruction is not supported.
 
 Assert
 ============================================================
@@ -2144,12 +2144,12 @@ Device-Side Dynamic Global Memory Allocation
 Device-side dynamic global memory allocation is under development.  HIP now includes a preliminary
 implementation of malloc and free that can be called from device functions.
 
-`__launch_bounds__`
+``__launch_bounds__``
 ============================================================
 
 GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance.
 
-__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code.  It is a function attribute that must be attached to a __global__ function:
+``__launch_bounds__`` allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code.  It is a function attribute that must be attached to a __global__ function:
 
  .. code-block:: cpp
 
@@ -2157,7 +2157,7 @@ __launch_bounds__ allows the application to provide usage hints that influence t
   MyKernel(hipGridLaunch lp, ...)
   ...
 
-__launch_bounds__ supports two parameters:
+``__launch_bounds__`` supports two parameters:
 - MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the .maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time.
 The threads-per-block is the product of (blockDim.x * blockDim.y * blockDim.z).
 - MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.
@@ -2184,16 +2184,16 @@ CU and EU Definitions
 
 A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
 
-Porting from CUDA `__launch_bounds`
+Porting from CUDA ``__launch_bounds``
 --------------------------------------------------------------------------------------------
 
-CUDA defines a __launch_bounds which is also designed to control occupancy:
+CUDA defines a ``__launch_bounds`` which is also designed to control occupancy:
 
  .. code-block:: cpp
 
   __launch_bounds(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
 
-- The second parameter __launch_bounds parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).
+- The second parameter ``__launch_bounds`` parameters must be converted to the format used __hip_launch_bounds, which uses warps and execution-units rather than blocks and multi-processors (this conversion is performed automatically by HIPIFY tools).
 
  .. code-block:: cpp
 
@@ -2276,9 +2276,9 @@ GCN ISA In-line assembly, is supported. For example:
 
   asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
 
-We insert the GCN isa into the kernel using `asm()` Assembler statement.
-`volatile` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
-`v_mac_f32_e32` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
+We insert the GCN isa into the kernel using ``asm()`` Assembler statement.
+``volatile`` keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations.
+``v_mac_f32_e32`` is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/)
 Index for the respective operand in the ordered fashion is provided by `%` followed by position in the list of operands
 `"v"` is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list)
 Output Constraints are specified by an `"="` prefix as shown above ("=v"). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of `"0"` says to use the assigned register for output as an input as well (it being the 0'th constraint).
@@ -2293,7 +2293,7 @@ Virtual functions are not supported if objects containing virtual function table
 Kernel Compilation
 ============================================================
 hipcc now supports compiling C++/HIP kernels to binary code objects.
-The file format for binary is `.co` which means Code Object. The following command builds the code object using `hipcc`.
+The file format for binary is ``.co`` which means Code Object. The following command builds the code object using ``hipcc``.
 
  .. code-block:: bash
 
diff --git a/docs/understand/glossary.md b/docs/understand/glossary.md
index 021ac0efce..272acd4beb 100644
--- a/docs/understand/glossary.md
+++ b/docs/understand/glossary.md
@@ -1,6 +1,6 @@
 # Glossary of terms
 
-* **host**, **host cpu** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
+* **host**, **host CPU** : Executes the HIP runtime API and is capable of initiating kernel launches to one or more devices.
 * **default device** : Each host thread maintains a default device.
 Most HIP runtime APIs (including memory allocation, copy commands, kernel launches) do not accept an explicit device
 argument but instead implicitly use the default device.
@@ -19,6 +19,6 @@ clr (https://github.com/ROCm/clr) contains the following three parts,
 
 * **hipify tools** - tools to convert CUDA code to portable C++ code (https://github.com/ROCm/HIPIFY).
 
-* **hipconfig** - tool to report various configuration properties of the target platform.
+* **`hipconfig`** - tool to report various configuration properties of the target platform.
 
-* **nvcc** - NVIDIA CUDA `nvcc` compiler, do not capitalize.
+* **`nvcc`** - NVIDIA CUDA `nvcc` compiler, do not capitalize.