Skip to content

Commit

Permalink
Fix the documentation problems
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Sep 6, 2024
1 parent da35eb4 commit b5e3e46
Show file tree
Hide file tree
Showing 6 changed files with 136 additions and 15 deletions.
9 changes: 7 additions & 2 deletions docs/how-to/performance_guidelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ optimization potential:
This document discusses the usage and benefits of these cornerstones in detail.

.. _parallel execution:

Parallel execution
====================

Expand Down Expand Up @@ -67,6 +68,7 @@ GPU resources, ranging from individual multiprocessors to the device as a
whole.

.. _memory optimization:

Memory throughput optimization
===============================

Expand Down Expand Up @@ -94,6 +96,7 @@ impact on performance.
The memory throughput optimization techniques are further discussed in detail in the following sections.

.. _data transfer:

Data transfer
---------------

Expand All @@ -112,6 +115,7 @@ memory accesses. The process where threads in a warp access sequential memory lo
On integrated systems where device and host memory are physically the same, no copy operation between host and device memory is required and hence mapped page-locked memory should be used instead. To check if the device is integrated, applications can query the integrated device property.

.. _device memory access:

Device memory access
---------------------

Expand All @@ -121,8 +125,7 @@ and is generally reduced when addresses are more scattered, especially in
global memory.

Device memory is accessed via 32-, 64-, or 128-byte transactions that must be
naturally aligned.
Maximizing memory throughput involves:
naturally aligned.Maximizing memory throughput involves:

- Coalescing memory accesses of threads within a warp into minimal transactions.
- Following optimal access patterns.
Expand Down Expand Up @@ -158,6 +161,7 @@ Reading device memory through texture or surface fetching provides the following
- Optional conversion of 8-bit and 16-bit integer input data to 32-bit floating-point values on the fly.

.. _instruction optimization:

Optimization for maximum instruction throughput
=================================================

Expand Down Expand Up @@ -185,6 +189,7 @@ Leverage intrinsic functions: Intrinsic functions are predefined functions avail
Optimize memory access: The memory access efficiency can impact the speed of arithmetic operations. See: :ref:`device memory access`.

.. _control flow instructions:

Control flow instructions
---------------------------

Expand Down
2 changes: 2 additions & 0 deletions docs/how-to/stream_ordered_allocator.rst
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,7 @@ Trim pools
The memory allocator allows you to allocate and free memory in stream order. To control memory usage, set the release threshold attribute using ``hipMemPoolAttrReleaseThreshold``. This threshold specifies the amount of reserved memory in bytes to hold onto.

.. code-block:: cpp
uint64_t threshold = UINT64_MAX;
hipMemPoolSetAttribute(memPool, hipMemPoolAttrReleaseThreshold, &threshold);
Expand Down Expand Up @@ -466,6 +467,7 @@ Here is how to read the pool exported in the preceding example:
}
.. _shareable-handle:

Shareable handle
----------------

Expand Down
5 changes: 4 additions & 1 deletion docs/understand/hardware_capabilities.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,10 @@
Hardware features
*******************************************************************************

This page gives an overview of the different hardware architectures and the features they implement. Hardware features do not imply performance, that depends on the specifications found in the `Accelerator and GPU hardware specifications`_ page.
This page gives an overview of the different hardware architectures and the
features they implement. Hardware features do not imply performance, that
depends on the specifications found in the :doc:`rocm:reference/gpu-arch-specs`
page.

.. list-table::
:header-rows: 1
Expand Down
123 changes: 115 additions & 8 deletions docs/understand/hip_runtime_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ versions of global memory with different usage semantics which are typically
backed by the same hardware storing global.

Constant memory
^^^^^^^^^^^^^
^^^^^^^^^^^^^^^

Read-only storage visible to all threads in a given grid. It is a limited
segment of global with queryable size.
Expand Down Expand Up @@ -209,11 +209,8 @@ performance in GPU-accelerated applications that require data modification.

For further details, check `HIP Runtime API Reference <doxygen/html/index.html>`_.

Execution control
=================

Stream management
-----------------
=================

Stream management refers to the mechanisms that allow developers to control the
order and concurrency of kernel executions and memory transfers on the GPU.
Expand All @@ -230,10 +227,120 @@ The stream management concept represented on the following figure.

.. figure:: ../data/understand/hip_runtime_api/stream_management.svg

Graph management
----------------
HIP graph
================================================================================

HIP graphs are an alternative way of executing work on a GPU. It can provide
performance benefits over repeatedly launching the same kernels in the standard
way via streams.

.. Copy here the HIP Graph understand page
.. note::
The HIP graph API is currently in Beta. Some features can change and might
have outstanding issues. Not all features supported by CUDA graphs are yet
supported. For a list of all currently supported functions see the
:doc:`HIP graph API documentation<../doxygen/html/group___graph>`.

Setting up HIP graphs
--------------------------------------------------------------------------------

HIP graphs can be created by explicitly defining them, or using stream capture.
For further information on how to use HIP graphs see :ref:`the how-to-chapter about HIP graphs<how_to_HIP_graph>`.
For the available functions see the
:doc:`HIP graph API documentation<../doxygen/html/group___graph>`.

Graph format
--------------------------------------------------------------------------------

A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent
the operations performed, while the edges mark dependencies between those
operations.

The nodes can be one of the following:

- empty nodes
- nested graphs
- kernel launches
- host-side function calls
- HIP memory functions (copy, memset, ...)
- HIP events
- signalling or waiting on external semaphores

The following figure visualizes the concept of graphs, compared to using streams.

.. figure:: ../data/understand/hipgraph/hip_graph.svg
:alt: Diagram depicting the difference between using streams to execute
kernels with dependencies, resolved by explicitly calling
hipDeviceSynchronize, or using graphs, where the edges denote the
dependencies.

Node types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. doxygenenum:: hipGraphNodeType

Memory management nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Memory management nodes handle allocating and freeing of memory of a graph.
Memory management nodes can be created by using the explicit API functions, or
by capturing :cpp:func:`hipMallocAsync` or :cpp:func:`hipFreeAsync`.
Unlike the normal memory management API, which is controlled by host-side execution,
this enables HIP to take care of memory reuse and optimizations.
The lifetime of memory allocated in a graph begins when the execution reaches the
node allocating the memory, and ends when either reaching the corresponding
free node within the graph, or after graph execution with a corresponding
:cpp:func:`hipFreeAsync` call, or a corresponding :cpp:func:`hipFree` call.
The memory can also be freed with a free node in a different graph that is
associated with the same memory address.

The same rules as for normal memory allocations apply for memory allocated and
freed by nodes, meaning that the nodes that access memory allocated in a graph
must be ordered after the allocation node and before the freeing node.

These memory allocations can also be set up to allow access from multiple GPUs,
just like normal allocations. HIP then takes care of allocating and mapping the
memory to the GPUs. When capturing a graph from a stream, the node sets the
accessibility according to hipMemPoolSetAccess at the time of capturing.

HIP graph advantages
--------------------------------------------------------------------------------

The standard way of launching work on GPUs via streams incurs a small overhead
for each iteration of the operation involved. For kernels that perform large
operations during an iteration this overhead usually is negligible. However
many workloads, including scientific simulations and AI, a kernel performs a
small operation for many iterations, and so the overhead of launching kernels
can be a significant cost on performance.

HIP graphs have been specifically designed to tackle this problem by only
requiring one launch from the host per iteration, and minimizing that overhead
by performing most of the initialization beforehand. Graphs can provide
additional performance benefits, by enabling optimizations that are only
possible when knowing the dependencies between the operations.

.. figure:: ../data/understand/hipgraph/hip_graph_speedup.svg
:alt: Diagram depicting the speed up achievable with HIP graphs compared to
HIP streams when launching many short-running kernels.

Qualitative presentation of the execution time of many short-running kernels
when launched using HIP stream versus HIP graph. This does not include the
time needed to set up the graph.

HIP graph usage
--------------------------------------------------------------------------------

Using HIP graphs to execute your work requires three different steps, where the
first two are the initial setup and only need to be executed once. First the
definition of the operations (nodes) and the dependencies (edges) between them.
The second step is the instantiation of the graph. This takes care of validating
and initializing the graph, to reduce the overhead when executing the graph.

The third step is the actual execution of the graph, which then takes care of
launching all the kernels and executing the operations while respecting their
dependencies and necessary synchronizations as specified.

As HIP graphs require some set up and initialization overhead before their first
execution, they only provide a benefit for workloads that require many iterations to complete.

Error handling
==============
Expand Down
4 changes: 1 addition & 3 deletions docs/understand/hipgraph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,7 @@ The following figure visualizes the concept of graphs, compared to using streams
Node types
--------------------------------------------------------------------------------

The node types are specified by `hipGraphNodeType`:

:cpp:enum-class:`hipGraphNodeType`
.. doxygenenum:: hipGraphNodeType

Memory management nodes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
8 changes: 7 additions & 1 deletion docs/understand/texture_fetching.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. meta::
:description: This chapter describes the texture fetching modes of the HIP ecosystem
ROCm software.
ROCm software.
:keywords: AMD, ROCm, HIP, Texture, Texture Fetching

*******************************************************************************
Expand Down Expand Up @@ -36,6 +36,7 @@ Texture sampling handles the usage of fractional indices. It is the method that
The various texture sampling methods are discussed in the following sections.

.. _texture_fetching_nearest:

Nearest point sampling
-------------------------------------------------------------------------------

Expand All @@ -57,6 +58,7 @@ The following image shows a texture stretched to a 4x4 pixel quad but still inde
Texture upscaled with nearest point sampling

.. _texture_fetching_linear:

Linear filtering
-------------------------------------------------------------------------------

Expand Down Expand Up @@ -87,6 +89,7 @@ Texture addressing mode handles the index that is out of bounds of the texture.
The following sections describe the various texture addressing methods.

.. _texture_fetching_border:

Address mode border
-------------------------------------------------------------------------------

Expand All @@ -104,6 +107,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
The purple lines are not part of the texture. They only denote the edge, where the addressing begins.

.. _texture_fetching_wrap:

Address mode wrap
-------------------------------------------------------------------------------

Expand All @@ -125,6 +129,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
The purple lines are not part of the texture. They only denote the edge, where the addressing begins.

.. _texture_fetching_mirror:

Address mode mirror
-------------------------------------------------------------------------------

Expand All @@ -142,6 +147,7 @@ The following image shows the texture on a 4x4 pixel quad, indexed in the [0 to
The purple lines are not part of the texture. They only denote the edge, where the addressing begins.

.. _texture_fetching_clamp:

Address mode clamp
-------------------------------------------------------------------------------

Expand Down

0 comments on commit b5e3e46

Please sign in to comment.