Fix the documentation problems

ROCm · Sep 6, 2024 · b5e3e46 · b5e3e46
1 parent da35eb4
commit b5e3e46
Show file tree

Hide file tree

Showing 6 changed files with 136 additions and 15 deletions.
diff --git a/docs/how-to/performance_guidelines.rst b/docs/how-to/performance_guidelines.rst
@@ -22,6 +22,7 @@ optimization potential:
 This document discusses the usage and benefits of these cornerstones in detail.
 
 .. _parallel execution:
+
 Parallel execution
 ====================
 
@@ -67,6 +68,7 @@ GPU resources, ranging from individual multiprocessors to the device as a
 whole.
 
 .. _memory optimization:
+
 Memory throughput optimization
 ===============================
 
@@ -94,6 +96,7 @@ impact on performance.
 The memory throughput optimization techniques are further discussed in detail in the following sections.
 
 .. _data transfer:
+
 Data transfer
 ---------------
 
@@ -112,6 +115,7 @@ memory accesses. The process where threads in a warp access sequential memory lo
 On integrated systems where device and host memory are physically the same, no copy operation between host and device memory is required and hence mapped page-locked memory should be used instead. To check if the device is integrated, applications can query the integrated device property.
 
 .. _device memory access:
+
 Device memory access
 ---------------------
 
@@ -121,8 +125,7 @@ and is generally reduced when addresses are more scattered, especially in
 global memory.
 
 Device memory is accessed via 32-, 64-, or 128-byte transactions that must be
-naturally aligned. 
-Maximizing memory throughput involves:
+naturally aligned.Maximizing memory throughput involves:
 
 - Coalescing memory accesses of threads within a warp into minimal transactions.
 - Following optimal access patterns.
@@ -158,6 +161,7 @@ Reading device memory through texture or surface fetching provides the following
 - Optional conversion of 8-bit and 16-bit integer input data to 32-bit floating-point values on the fly.
 
 .. _instruction optimization:
+
 Optimization for maximum instruction throughput
 =================================================
 
@@ -185,6 +189,7 @@ Leverage intrinsic functions: Intrinsic functions are predefined functions avail
 Optimize memory access: The memory access efficiency can impact the speed of arithmetic operations. See: :ref:`device memory access`.
 
 .. _control flow instructions:
+
 Control flow instructions
 ---------------------------
 

diff --git a/docs/how-to/stream_ordered_allocator.rst b/docs/how-to/stream_ordered_allocator.rst
@@ -216,6 +216,7 @@ Trim pools
 The memory allocator allows you to allocate and free memory in stream order. To control memory usage, set the release threshold attribute using ``hipMemPoolAttrReleaseThreshold``.  This threshold specifies the amount of reserved memory in bytes to hold onto.
 
 .. code-block:: cpp
+
     uint64_t threshold = UINT64_MAX;
     hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold);
 
@@ -466,6 +467,7 @@ Here is how to read the pool exported in the preceding example:
     }
 
 .. _shareable-handle:
+
 Shareable handle
 ----------------
 

diff --git a/docs/understand/hardware_capabilities.rst b/docs/understand/hardware_capabilities.rst
@@ -6,7 +6,10 @@
 Hardware features
 *******************************************************************************
 
-This page gives an overview of the different hardware architectures and the features they implement. Hardware features do not imply performance, that depends on the specifications found in the `Accelerator and GPU hardware specifications`_ page.
+This page gives an overview of the different hardware architectures and the
+features they implement. Hardware features do not imply performance, that
+depends on the specifications found in the :doc:`rocm:reference/gpu-arch-specs`
+page.
 
   .. list-table::
       :header-rows: 1

diff --git a/docs/understand/hip_runtime_api.rst b/docs/understand/hip_runtime_api.rst
@@ -136,7 +136,7 @@ versions of global memory with different usage semantics which are typically
 backed by the same hardware storing global.
 
 Constant memory
-^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^
 
 Read-only storage visible to all threads in a given grid. It is a limited 
 segment of global with queryable size.
@@ -209,11 +209,8 @@ performance in GPU-accelerated applications that require data modification.
 
 For further details, check `HIP Runtime API Reference <doxygen/html/index.html>`_.
 
-Execution control
-=================
-
 Stream management
------------------
+=================
 
 Stream management refers to the mechanisms that allow developers to control the
 order and concurrency of kernel executions and memory transfers on the GPU.
@@ -230,10 +227,120 @@ The stream management concept represented on the following figure.
 
 .. figure:: ../data/understand/hip_runtime_api/stream_management.svg
 
-Graph management
-----------------  
+HIP graph
+================================================================================  
+
+HIP graphs are an alternative way of executing work on a GPU. It can provide
+performance benefits over repeatedly launching the same kernels in the standard
+way via streams.
 
-.. Copy here the HIP Graph understand page
+.. note::
+    The HIP graph API is currently in Beta. Some features can change and might
+    have outstanding issues. Not all features supported by CUDA graphs are yet
+    supported. For a list of all currently supported functions see the
+    :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
+
+Setting up HIP graphs
+--------------------------------------------------------------------------------
+
+HIP graphs can be created by explicitly defining them, or using stream capture.
+For further information on how to use HIP graphs see :ref:`the how-to-chapter about HIP graphs<how_to_HIP_graph>`.
+For the available functions see the
+:doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
+
+Graph format
+--------------------------------------------------------------------------------
+
+A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent
+the operations performed, while the edges mark dependencies between those
+operations.
+
+The nodes can be one of the following:
+
+- empty nodes
+- nested graphs
+- kernel launches
+- host-side function calls
+- HIP memory functions (copy, memset, ...)
+- HIP events
+- signalling or waiting on external semaphores
+
+The following figure visualizes the concept of graphs, compared to using streams.
+
+.. figure:: ../data/understand/hipgraph/hip_graph.svg
+    :alt: Diagram depicting the difference between using streams to execute
+          kernels with dependencies, resolved by explicitly calling
+          hipDeviceSynchronize, or using graphs, where the edges denote the
+          dependencies.
+
+Node types
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. doxygenenum:: hipGraphNodeType
+
+Memory management nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Memory management nodes handle allocating and freeing of memory of a graph.
+Memory management nodes can be created by using the explicit API functions, or
+by capturing :cpp:func:`hipMallocAsync` or :cpp:func:`hipFreeAsync`.
+Unlike the normal memory management API, which is controlled by host-side execution,
+this enables HIP to take care of memory reuse and optimizations.
+The lifetime of memory allocated in a graph begins when the execution reaches the
+node allocating the memory, and ends when either reaching the corresponding
+free node within the graph, or after graph execution with a corresponding
+:cpp:func:`hipFreeAsync` call, or a corresponding :cpp:func:`hipFree` call.
+The memory can also be freed with a free node in a different graph that is
+associated with the same memory address.
+
+The same rules as for normal memory allocations apply for memory allocated and
+freed by nodes, meaning that the nodes that access memory allocated in a graph
+must be ordered after the allocation node and before the freeing node.
+
+These memory allocations can also be set up to allow access from multiple GPUs,
+just like normal allocations. HIP then takes care of allocating and mapping the
+memory to the GPUs. When capturing a graph from a stream, the node sets the
+accessibility according to hipMemPoolSetAccess at the time of capturing.
+
+HIP graph advantages
+--------------------------------------------------------------------------------
+
+The standard way of launching work on GPUs via streams incurs a small overhead
+for each iteration of the operation involved. For kernels that perform large
+operations during an iteration this overhead usually is negligible. However
+many workloads, including scientific simulations and AI, a kernel performs a
+small operation for many iterations, and so the overhead of launching kernels
+can be a significant cost on performance.
+
+HIP graphs have been specifically designed to tackle this problem by only
+requiring one launch from the host per iteration, and minimizing that overhead
+by performing most of the initialization beforehand. Graphs can provide
+additional performance benefits, by enabling optimizations that are only
+possible when knowing the dependencies between the operations.
+
+.. figure:: ../data/understand/hipgraph/hip_graph_speedup.svg
+    :alt: Diagram depicting the speed up achievable with HIP graphs compared to
+          HIP streams when launching many short-running kernels.
+
+    Qualitative presentation of the execution time of many short-running kernels
+    when launched using HIP stream versus HIP graph. This does not include the
+    time needed to set up the graph.
+
+HIP graph usage
+--------------------------------------------------------------------------------
+
+Using HIP graphs to execute your work requires three different steps, where the
+first two are the initial setup and only need to be executed once. First the
+definition of the operations (nodes) and the dependencies (edges) between them.
+The second step is the instantiation of the graph. This takes care of validating
+and initializing the graph, to reduce the overhead when executing the graph.
+
+The third step is the actual execution of the graph, which then takes care of
+launching all the kernels and executing the operations while respecting their
+dependencies and necessary synchronizations as specified.
+
+As HIP graphs require some set up and initialization overhead before their first
+execution, they only provide a benefit for workloads that require many iterations to complete.
 
 Error handling
 ==============

diff --git a/docs/understand/hipgraph.rst b/docs/understand/hipgraph.rst
@@ -54,9 +54,7 @@ The following figure visualizes the concept of graphs, compared to using streams
 Node types
 --------------------------------------------------------------------------------
 
-The node types are specified by `hipGraphNodeType`:
-
-:cpp:enum-class:`hipGraphNodeType`
+.. doxygenenum:: hipGraphNodeType
 
 Memory management nodes
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/understand/texture_fetching.rst b/docs/understand/texture_fetching.rst
@@ -1,6 +1,6 @@
 .. meta::
   :description: This chapter describes the texture fetching modes of the HIP ecosystem
-  ROCm software.
+                ROCm software.
   :keywords: AMD, ROCm, HIP, Texture, Texture Fetching
 
 *******************************************************************************
@@ -36,6 +36,7 @@ Texture sampling handles the usage of fractional indices. It is the method that
 The various texture sampling methods are discussed in the following sections.
 
 .. _texture_fetching_nearest:
+
 Nearest point sampling
 -------------------------------------------------------------------------------
 
@@ -57,6 +58,7 @@ The following image shows a texture stretched to a 4x4 pixel quad but still inde
   Texture upscaled with nearest point sampling
 
 .. _texture_fetching_linear:
+
 Linear filtering
 -------------------------------------------------------------------------------
 
@@ -87,6 +89,7 @@ Texture addressing mode handles the index that is out of bounds of the texture.
 The following sections describe the various texture addressing methods.
 
 .. _texture_fetching_border:
+
 Address mode border
 -------------------------------------------------------------------------------
 
@@ -104,6 +107,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
 The purple lines are not part of the texture. They only denote the edge, where the addressing begins.
 
 .. _texture_fetching_wrap:
+
 Address mode wrap
 -------------------------------------------------------------------------------
 
@@ -125,6 +129,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
 The purple lines are not part of the texture. They only denote the edge, where the addressing begins.
 
 .. _texture_fetching_mirror:
+
 Address mode mirror
 -------------------------------------------------------------------------------
 
@@ -142,6 +147,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
 The purple lines are not part of the texture. They only denote the edge, where the addressing begins.
 
 .. _texture_fetching_clamp:
+
 Address mode clamp
 -------------------------------------------------------------------------------