Add how-to chapter for HIP graphs

ROCm · Sep 12, 2024 · 28d1384 · 28d1384
1 parent 5f27a8a
commit 28d1384
Show file tree

Hide file tree

Showing 5 changed files with 469 additions and 226 deletions.
diff --git a/docs/how-to/hipgraph.rst b/docs/how-to/hipgraph.rst
@@ -0,0 +1,210 @@
+.. meta::
+    :description: This chapter describes how to use HIP graphs.
+    :keywords: ROCm, HIP, graph, stream
+
+.. _how_to_HIP_graph:
+
+********************************************************************************
+Using HIP graphs
+********************************************************************************
+
+This chapter explains how to create and use HIP graphs. To get a better understanding of
+HIP graphs see :ref:`the understand-chapter about HIP graphs<understand_HIP_graph>`.
+
+There are two different ways of creating graphs: Capturing kernel launches from a stream, or explicitly creating graphs.
+
+Either way ends up with a ``hipGraph_t``, which is a template for a graph.
+In order to actually launch a graph, the template needs to be instantiated using ``hipGraphInstantiate``,
+which results in an executable graph of type ``hipGraphExec_t``.
+This executable graph can then be launched with ``hipGraphLaunch``,
+replaying the operations within the graph.
+
+Stream capture
+=================================
+
+The easy way to integrate graphs into already existing code is to use stream capture.
+
+When starting to capture operations for a graph using ``hipStreamBeginCapture``,
+the operations assigned to the stream are captured into a graph instead of being
+executed. That graph is returned when calling ``hipStreamEndCapture``, which
+also stops capturing operations.
+
+The following code is an example of how to use the HIP graph API to capture a graph from a stream
+
+.. code-block:: cpp
+
+    #include <hip/hip_runtime.h>
+    #include <vector>
+
+    #define HIP_CHECK(c){if(c != hipSuccess) return -1;}
+
+    __global__ void kernelA(double* arrayA, size_t size);
+    __global__ void kernelB(int* arrayB, size_t size);
+    __global__ void kernelC(double* arrayA, int* arrayB, size_t size);
+
+    int main(){
+
+        size_t array_size = 1U << 20;
+        int numOfBlocks = 1024;
+        int threadsPerBlock = 1024;
+
+        double* d_arrayA;
+        int* d_arrayB;
+        std::vector<double> h_array(array_size);
+
+        HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA)));
+        HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB)));
+
+        hipStream_t captureStream;
+        HIP_CHECK(hipStreamCreate(&captureStream));
+        // Start capturing the operations
+        HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
+
+        HIP_CHECK(hipMemcpy(d_arrayA, &h_array, array_size * sizeof(*d_arrayA), hipMemcpyHostToDevice));
+
+        kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, array_size);
+        kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, array_size);
+
+        HIP_CHECK(hipDeviceSynchronize());
+
+        kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, array_size);
+
+        HIP_CHECK(hipMemcpy(&h_array, d_arrayA, array_size * sizeof(*d_arrayA), hipMemcpyDeviceToHost));
+
+        hipGraph_t graph;
+        HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
+
+        // Create an executable graph from the captured graph.
+        hipGraphExec_t graphExec;
+        HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+
+        // Actually launch the graph. The stream does not have
+        // to be the same as the one used for capturing.
+        HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
+
+        HIP_CHECK(hipGraphExecDestroy(graphExec));
+        HIP_CHECK(hipFree(d_arrayA));
+        HIP_CHECK(hipFree(d_arrayB));
+        HIP_CHECK(hipGraphDestroy(graph));
+        HIP_CHECK(hipStreamDestroy(captureStream));
+    }
+
+Direct graph creation
+=================================
+
+Graphs can also be created directly using the HIP graph API, giving more fine-grained control over the graph.
+The nodes are represented by ``hipGraphNode_t``, and the specific parameters
+have a separate type each, e.g. ``hipKernelNodeParams``. Depending on the
+operation, the function to call for adding the node varies. For kernel nodes
+it is ``hipGraphAddKernelNode``, or for memory copies it is ``hipGraphAddMemcpyNode``.
+For a full list see the :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
+
+.. code-block:: cpp
+
+    #include <hip/hip_runtime.h>
+    #include <vector>
+
+    #define HIP_CHECK(c){if(c != hipSuccess) return -1;}
+
+    __global__ void kernelA(double* arrayA, size_t size);
+    __global__ void kernelB(int* arrayB, size_t size);
+    __global__ void kernelC(double* arrayA, int* arrayB, size_t size);
+
+    int main(){
+
+        size_t array_size = 1U << 20;
+        int numberOfBlocks = 1024;
+        int threadsPerBlock = 1024;
+
+        double* d_arrayA;
+        int* d_arrayB;
+        std::vector<double> h_array(array_size);
+
+        HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA)));
+        HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB)));
+
+        // Set up parameters for kernel and copy nodes
+        hipKernelNodeParams kernelAParams, kernelBParams, kernelCParams;
+        hipMemcpy3DParms cpyToDevParams, cpyToHostParams;
+
+        void* kernelAArgs[] = {static_cast<void*>(&d_arrayA), static_cast<void*>(&array_size)};
+        kernelAParams.func = reinterpret_cast<void*>(kernelA);
+        kernelAParams.gridDim = numberOfBlocks;
+        kernelAParams.blockDim = threadsPerBlock;
+        kernelAParams.sharedMemBytes = 0;
+        kernelAParams.kernelParams = kernelAArgs;
+        kernelAParams.extra = nullptr;
+
+        void* kernelBArgs[] = {static_cast<void*>(&d_arrayB), static_cast<void*>(&array_size)};
+        kernelBParams.func = reinterpret_cast<void*>(kernelB);
+        kernelAParams.gridDim = numberOfBlocks;
+        kernelAParams.blockDim = threadsPerBlock;
+        kernelAParams.sharedMemBytes = 0;
+        kernelAParams.kernelParams = kernelBArgs;
+        kernelAParams.extra = nullptr;
+
+        void* kernelCArgs[] = {static_cast<void*>(&d_arrayA), static_cast<void*>(&d_arrayB), static_cast<void*>(&array_size)};
+        kernelCParams.func = reinterpret_cast<void*>(kernelC);
+        kernelAParams.gridDim = numberOfBlocks;
+        kernelAParams.blockDim = threadsPerBlock;
+        kernelAParams.sharedMemBytes = 0;
+        kernelAParams.kernelParams = kernelCArgs;
+        kernelAParams.extra = nullptr;
+
+        cpyToDevParams.srcArray = nullptr;
+        cpyToDevParams.srcPos = make_hipPos(0, 0, 0);
+        cpyToDevParams.srcPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1);
+        cpyToDevParams.dstArray = nullptr;
+        cpyToDevParams.dstPos = make_hipPos(0, 0, 0);
+        cpyToDevParams.dstPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1);
+        cpyToDevParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1);
+        cpyToDevParams.kind = hipMemcpyHostToDevice;
+
+        cpyToHostParams.srcArray = nullptr;
+        cpyToHostParams.srcPos = make_hipPos(0, 0, 0);
+        cpyToHostParams.srcPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1);
+        cpyToHostParams.dstArray = nullptr;
+        cpyToHostParams.dstPos = make_hipPos(0, 0, 0);
+        cpyToHostParams.dstPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1);
+        cpyToHostParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1);
+        cpyToHostParams.kind = hipMemcpyDeviceToHost;
+
+        // Create graph and add nodes with their respective parameters
+        hipGraph_t graph;
+        hipGraphNode_t kernelANode, kernelBNode, kernelCNode, cpyToDevNode, cpyToHostNode;
+        HIP_CHECK(hipGraphCreate(&graph, 0));
+
+        // Add copy operations
+        HIP_CHECK(hipGraphAddMemcpyNode(&cpyToDevNode, graph, nullptr, 0, &cpyToDevParams));
+        HIP_CHECK(hipGraphAddMemcpyNode(&cpyToHostNode, graph, nullptr, 0, &cpyToHostParams));
+
+        // Add kernels to graph
+        HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, nullptr, 0, &kernelAParams));
+        HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, nullptr, 0, &kernelBParams));
+        HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, nullptr, 0, &kernelCParams));
+    
+        // Add dependencies between nodes
+        // kernels A and B have to wait for the copy operation
+        HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelANode, 1));
+        HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelBNode, 1));
+        // kernel C is dependent on kernels A and B
+        HIP_CHECK(hipGraphAddDependencies(graph, &kernelANode, &kernelCNode, 1));
+        HIP_CHECK(hipGraphAddDependencies(graph, &kernelBNode, &kernelCNode, 1));
+        // The copy back to the host has to wait for kernel C to finish
+        HIP_CHECK(hipGraphAddDependencies(graph, &kernelCNode, &cpyToHostNode, 1));
+    
+        // Instantiate graph the just created graph in order to execute it
+        hipGraphExec_t graphExec;
+        HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
+
+        // Launch the executable graph
+        hipStream_t graphStream;
+        HIP_CHECK(hipStreamCreate(&graphStream));
+        HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
+    
+        HIP_CHECK(hipGraphExecDestroy(graphExec));
+        HIP_CHECK(hipFree(d_arrayA));
+        HIP_CHECK(hipFree(d_arrayB));
+        HIP_CHECK(hipGraphDestroy(graph));
+        HIP_CHECK(hipStreamDestroy(graphStream));
+    }
diff --git a/docs/index.md b/docs/index.md
@@ -48,6 +48,7 @@ On non-AMD platforms, like NVIDIA, HIP provides header files required to support
 * [Unified memory](./how-to/unified_memory)
 * [Virtual memory](./how-to/virtual_memory)
 * [Cooperative groups](./how-to/cooperative_groups)
+* [HIP graphs](./how-to/hipgraph)
 * {doc}`./how-to/faq`
 
 :::

diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
@@ -34,6 +34,8 @@ subtrees:
     title: Unified memory
   - file: how-to/virtual_memory
     title: Virtual memory
+  - file: how-to/hipgraph
+    title: HIP graphs
   - file: how-to/faq
 
 - caption: Reference

diff --git a/docs/understand/hipgraph.rst b/docs/understand/hipgraph.rst
@@ -2,7 +2,7 @@
     :description: This chapter provides an overview over the usage of HIP graph.
     :keywords: ROCm, HIP, graph, stream
 
-.. understand_HIP_graph:
+.. _understand_HIP_graph:
 
 ********************************************************************************
 HIP graph
@@ -18,7 +18,7 @@ A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent
 the operations performed, while the edges mark dependencies between those
 operations.
 
-The nodes can consist of:
+The nodes can be one of the following:
 
 - empty nodes
 - nested graphs
@@ -36,6 +36,35 @@ The following figure visualizes the concept of graphs, compared to using streams
           hipDeviceSynchronize, or using graphs, where the edges denote the
           dependencies.
 
+Node types
+--------------------------------------------------------------------------------
+
+The available node types are specified by :cpp:enumerator:`hipGraphNodeType`.
+
+Memory management nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Memory management nodes handle allocating and freeing of memory of a graph.
+Memory management nodes can be created by using the explicit API functions, or
+by capturing :cpp:func:`hipMallocAsync` or :cpp:func:`hipFreeAsync`.
+Unlike the normal memory management API, which is controlled by host-side execution,
+this enables HIP to take care of memory reuse and optimizations.
+The lifetime of memory allocated in a graph begins when the execution reaches the
+node allocating the memory, and ends when either reaching the corresponding
+free node within the graph, or after graph execution with a corresponding
+:cpp:func:`hipFreeAsync` call, or a corresponding :cpp:func:`hipFree` call.
+The memory can also be freed with a free node in a different graph that is
+associated with the same memory address.
+
+The same rules as for normal memory allocations apply for memory allocated and
+freed by nodes, meaning that the nodes that access memory allocated in a graph
+must be ordered after the allocation node and before the freeing node.
+
+These memory allocations can also be set up to allow access from multiple GPUs,
+just like normal allocations. HIP then takes care of allocating and mapping the
+memory to the GPUs. When capturing a graph from a stream, the node sets the
+accessibility according to hipMemPoolSetAccess at the time of capturing.
+
 HIP graph advantages
 ================================================================================
 
@@ -80,5 +109,6 @@ Setting up HIP graphs
 ================================================================================
 
 HIP graphs can be created by explicitly defining them, or using stream capture.
+For further information on how to use HIP graphs see :ref:`the how-to-chapter about HIP graphs<how_to_HIP_graph>`.
 For the available functions see the
 :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.