diff --git a/.wordlist.txt b/.wordlist.txt
index 170801a383..ab8bbe859d 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -89,7 +89,9 @@ overindexing
 oversubscription
 pixelated
 pragmas
+preallocated
 preconditioners
+predefining
 prefetched
 preprocessor
 PTX
diff --git a/docs/data/understand/hipgraph/hip_graph.drawio b/docs/data/how-to/hipgraph/hip_graph.drawio
similarity index 100%
rename from docs/data/understand/hipgraph/hip_graph.drawio
rename to docs/data/how-to/hipgraph/hip_graph.drawio
diff --git a/docs/data/understand/hipgraph/hip_graph.svg b/docs/data/how-to/hipgraph/hip_graph.svg
similarity index 100%
rename from docs/data/understand/hipgraph/hip_graph.svg
rename to docs/data/how-to/hipgraph/hip_graph.svg
diff --git a/docs/data/understand/hipgraph/hip_graph_speedup.drawio b/docs/data/how-to/hipgraph/hip_graph_speedup.drawio
similarity index 100%
rename from docs/data/understand/hipgraph/hip_graph_speedup.drawio
rename to docs/data/how-to/hipgraph/hip_graph_speedup.drawio
diff --git a/docs/data/understand/hipgraph/hip_graph_speedup.svg b/docs/data/how-to/hipgraph/hip_graph_speedup.svg
similarity index 100%
rename from docs/data/understand/hipgraph/hip_graph_speedup.svg
rename to docs/data/how-to/hipgraph/hip_graph_speedup.svg
diff --git a/docs/how-to/hipgraph.rst b/docs/how-to/hipgraph.rst
index 46293f6aeb..c583a3de56 100644
--- a/docs/how-to/hipgraph.rst
+++ b/docs/how-to/hipgraph.rst
@@ -1,210 +1,427 @@
 .. meta::
-    :description: This chapter describes how to use HIP graphs.
+    :description: This chapter describes how to use HIP graphs and highlights their use cases.
     :keywords: ROCm, HIP, graph, stream
 
 .. _how_to_HIP_graph:
 
 ********************************************************************************
-Using HIP graphs
+HIP graphs
 ********************************************************************************
 
-This chapter explains how to create and use HIP graphs. To get a better understanding of
-HIP graphs see :ref:`the understand-chapter about HIP graphs<understand_HIP_graph>`.
-
-There are two different ways of creating graphs: Capturing kernel launches from a stream, or explicitly creating graphs.
+.. note::
+    The HIP graph API is currently in Beta. Some features can change and might
+    have outstanding issues. Not all features supported by CUDA graphs are yet
+    supported. For a list of all currently supported functions see the
+    :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
+
+HIP graphs are an alternative way of executing tasks on a GPU that can provide 
+performance benefits over launching kernels using the standard 
+method via streams. A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent
+the operations performed, while the edges mark dependencies between those
+operations.
+
+The nodes can be one of the following:
+
+- empty nodes
+- nested graphs
+- kernel launches
+- host-side function calls
+- HIP memory functions (copy, memset, ...)
+- HIP events
+- signalling or waiting on external semaphores
+
+.. note::
+    The available node types are specified by ``hipGraphNodeType``. 
+
+The following figure visualizes the concept of graphs, compared to using streams.
+
+.. figure:: ../data/how-to/hipgraph/hip_graph.svg
+    :alt: Diagram depicting the difference between using streams to execute
+          kernels with dependencies, resolved by explicitly calling
+          hipDeviceSynchronize, or using graphs, where the edges denote the
+          dependencies.
+
+The standard method of launching kernels incurs a small overhead
+for each iteration of the operation involved. For kernels that perform large
+operations during an iteration this overhead is usually negligible. However
+in many workloads, such as scientific simulations and AI, a kernel might perform a
+small operation over a great number of iterations, and so the overhead of repeatedly
+launching kernels can have a significant impact on performance.
+
+HIP graphs are designed to address this issue, by predefining the HIP API calls
+and their dependencies with a graph, and performing most of the initialization
+beforehand. Launching a graph only requires a single call, after which the
+driver takes care of executing the operations within the graph.
+Graphs can provide additional performance benefits, by enabling optimizations
+that are only possible when knowing the dependencies between the operations.
+
+.. figure:: ../data/how-to/hipgraph/hip_graph_speedup.svg
+    :alt: Diagram depicting the speed up achievable with HIP graphs compared to
+          HIP streams when launching many short-running kernels.
+    
+    Qualitative presentation of the execution time of many short-running kernels
+    when launched using HIP stream versus HIP graph. This does not include the
+    time needed to set up the graph.
 
-Either way ends up with a ``hipGraph_t``, which is a template for a graph.
-In order to actually launch a graph, the template needs to be instantiated using ``hipGraphInstantiate``,
-which results in an executable graph of type ``hipGraphExec_t``.
-This executable graph can then be launched with ``hipGraphLaunch``,
-replaying the operations within the graph.
 
-Stream capture
-=================================
+********************************************************************************
+Using HIP graphs
+********************************************************************************
 
-The easy way to integrate graphs into already existing code is to use stream capture.
+There are two different ways of creating graphs: Capturing kernel launches from
+a stream, or explicitly creating graphs. The difference between the two
+approaches is explained later in this chapter.
+
+The general flow for using HIP graphs includes the following steps.
+
+#. Create a ``hipGraph_t`` graph template using one of the two approaches described in this chapter
+#. Create a ``hipGraphExec_t`` executable instance of the graph template using ``hipGraphInstantiate``
+#. Use ``hipGraphLaunch`` to launch the executable graph to a stream
+#. After execution completes free and destroy graph resources 
+
+The first two steps are the initial setup and only need to be executed once. First
+step is the definition of the operations (nodes) and the dependencies (edges) 
+between them. The second step is the instantiation of the graph. This takes care 
+of validating and initializing the graph, to reduce the overhead when executing 
+the graph. The third step is the execution of the graph, which takes care of
+launching all the kernels and executing the operations while respecting their
+dependencies and necessary synchronizations as specified.
+
+Because HIP graphs require some setup and initialization overhead before their
+first execution, graphs only provide a benefit for workloads that require 
+many iterations to complete.
+
+In both methods the ``hipGraph_t`` template for a graph is used to define the graph.
+In order to actually launch a graph, the template needs to be instantiated using
+``hipGraphInstantiate``, which results in an actually executable graph of type ``hipGraphExec_t``.
+This executable graph can then be launched with ``hipGraphLaunch``, replaying the
+operations within the graph. Note, that launching graphs is fundamentally no
+different to executing other HIP functions on a stream, except for the fact,
+that scheduling the operations within the graph encompasses less overhead and
+can enable some optimizations, but they still need to be associated with a stream for execution.
+
+Memory management
+-----------------
+
+Memory that is used by operations in graphs can either be pre-allocated or
+managed within the graph. Graphs can contain nodes that take care of allocating
+memory on the device or copying memory between the host and the device.
+Whether you want to pre-allocate the memory or manage it within the graph
+depends on the use-case. If the graph is executed in a tight loop the
+performance is usually better when the memory is preallocated, so that it
+does not need to be reallocated in every iteration.
+
+The same rules as for normal memory allocations apply for memory allocated and
+freed by nodes, meaning that the nodes that access memory allocated in a graph
+must be ordered after allocation and before freeing.
+
+Unlike the normal memory management API, which is controlled by host-side execution,
+this enables the driver to take care of memory reuse and optimizations.
+The lifetime of memory allocated in a graph begins when the execution reaches the
+node allocating the memory, and ends when either reaching the corresponding
+free node within the graph, or after graph execution when a corresponding
+:cpp:func:`hipFreeAsync` or :cpp:func:`hipFree` call is reached.
+The memory can also be freed with a free node in a different graph that is
+associated with the same memory address.
+
+These memory allocations can also be set up to allow access from multiple GPUs,
+just like normal allocations. HIP then takes care of allocating and mapping the
+memory to the GPUs. When capturing a graph from a stream, the node sets the
+accessibility according to ``hipMemPoolSetAccess`` at the time of capturing.
+
+
+Capture graphs from a stream
+================================================================================
+
+The easy way to integrate HIP graphs into already existing code is to use
+``hipStreamBeginCapture`` and ``hipStreamEndCapture`` to obtain a ``hipGraph_t``
+graph template that includes the captured operations.
 
 When starting to capture operations for a graph using ``hipStreamBeginCapture``,
 the operations assigned to the stream are captured into a graph instead of being
-executed. That graph is returned when calling ``hipStreamEndCapture``, which
+executed. The associated graph is returned when calling ``hipStreamEndCapture``, which
 also stops capturing operations.
+In order to capture to an already existing graph use ``hipStreamBeginCaptureToGraph``.
+
+The functions assigned to the capturing stream are not executed, but instead are
+captured and defined as nodes in the graph, to be run when the graph is
+instantiated and launched. 
+
+Only functions that are launched to the stream are captured, meaning that
+instructions on the host, that the GPU operations depend on, need to be scheduled
+using ``hipLaunchHostFunc``, as shown in the following code example.
+Host functions that are not executed using ``hipLaunchHostFunc`` will be executed as
+encountered and not captured.
 
-The following code is an example of how to use the HIP graph API to capture a graph from a stream
+Synchronous HIP API calls that are implicitly assigned to the default stream are
+not permitted while capturing a stream , and will return an error. This is
+because they implicitly synchronize and cause a dependency that can not be
+captured within the stream. This includes functions like ``hipMalloc``,
+``hipMemcpy`` and ``hipFree``. In order to capture these to the stream, replace
+them with the corresponding asynchronous calls like ``hipMallocAsync``, ``hipMemcpyAsync`` or ``hipFreeAsync``.
+
+The general flow for using stream capture in order to get a graph template is:
+
+#. Create a stream from which to capture the operations
+#. Call ``hipStreamBeginCapture`` before the first operation to be captured
+#. Call ``hipStreamEndCapture`` after the last operation to be captured
+
+    #. Define a ``hipGraph_t`` graph template to which ``hipStreamEndCapture``
+       passes the captured graph
+
+The following code is an example of how to use the HIP graph API to capture a
+graph from a stream.
 
 .. code-block:: cpp
 
     #include <hip/hip_runtime.h>
     #include <vector>
 
-    #define HIP_CHECK(c){if(c != hipSuccess) return -1;}
-
     __global__ void kernelA(double* arrayA, size_t size);
     __global__ void kernelB(int* arrayB, size_t size);
-    __global__ void kernelC(double* arrayA, int* arrayB, size_t size);
+    __global__ void kernelC(double* arrayA, const int* arrayB, size_t size);
+
+    struct set_vector_args{
+    	std::vector<double>& h_array;
+    	double value;
+    };
 
-    int main(){
+    void set_vector(void* args){
+    	set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
 
-        size_t array_size = 1U << 20;
-        int numOfBlocks = 1024;
-        int threadsPerBlock = 1024;
+    	std::vector<double>& vec{h_args.h_array};
+    	vec.assign(vec.size(), h_args.value);
+    }
+
+    void stream_capture_example(){
+        constexpr int numOfBlocks = 1024;
+        constexpr int threadsPerBlock = 1024;
+        constexpr size_t arraySize = 1U << 20;
+
+        // This example assumes that kernelA operates on data that needs to be initialized on
+        // and copied from the host, while kernelB initializes the array that is passed to it.
+        // Both arrays are then used as input to kernelC, where arrayA is also used as
+       //  output, that is copied back to the host, while arrayB is only read from and not modified.
 
         double* d_arrayA;
         int* d_arrayB;
-        std::vector<double> h_array(array_size);
-
-        HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA)));
-        HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB)));
+        std::vector<double> h_array(arraySize);
+        constexpr double initValue = 2.0;
 
         hipStream_t captureStream;
         HIP_CHECK(hipStreamCreate(&captureStream));
-        // Start capturing the operations
+
+        // Start capturing the operations assigned to the stream
         HIP_CHECK(hipStreamBeginCapture(captureStream, hipStreamCaptureModeGlobal));
 
-        HIP_CHECK(hipMemcpy(d_arrayA, &h_array, array_size * sizeof(*d_arrayA), hipMemcpyHostToDevice));
+        // hipMallocAsync and hipMemcpyAsync are needed, to be able to assign it to a stream
+        HIP_CHECK(hipMallocAsync(&d_arrayA, arraySize*sizeof(double), captureStream));
+        HIP_CHECK(hipMallocAsync(&d_arrayB, arraySize*sizeof(int), captureStream));
+
+        // Assign host function to the stream
+        // Needs a custom struct to pass the arguments
+        set_vector_args args{h_array, initValue};
+        HIP_CHECK(hipLaunchHostFunc(captureStream, set_vector, &args));
 
-        kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, array_size);
-        kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, array_size);
+        HIP_CHECK(hipMemcpyAsync(d_arrayA, h_array.data(), arraySize*sizeof(double), hipMemcpyHostToDevice, captureStream));
 
-        HIP_CHECK(hipDeviceSynchronize());
+        kernelA<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, arraySize);
+        kernelB<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayB, arraySize);
+        kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, arraySize);
 
-        kernelC<<<numOfBlocks, threadsPerBlock, 0, captureStream>>>(d_arrayA, d_arrayB, array_size);
+        HIP_CHECK(hipMemcpyAsync(h_array.data(), d_arrayA, arraySize*sizeof(*d_arrayA), hipMemcpyDeviceToHost, captureStream));
 
-        HIP_CHECK(hipMemcpy(&h_array, d_arrayA, array_size * sizeof(*d_arrayA), hipMemcpyDeviceToHost));
+        HIP_CHECK(hipFreeAsync(d_arrayA, captureStream));
+        HIP_CHECK(hipFreeAsync(d_arrayB, captureStream));
 
+        // Stop capturing
         hipGraph_t graph;
         HIP_CHECK(hipStreamEndCapture(captureStream, &graph));
 
-        // Create an executable graph from the captured graph.
+        // Create an executable graph from the captured graph
         hipGraphExec_t graphExec;
         HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
 
+        // The graph template can be deleted after the instantiation if it's not needed for later use
+        HIP_CHECK(hipGraphDestroy(graph));
+
         // Actually launch the graph. The stream does not have
         // to be the same as the one used for capturing.
         HIP_CHECK(hipGraphLaunch(graphExec, captureStream));
 
+        // Free graph and stream resources after usage
         HIP_CHECK(hipGraphExecDestroy(graphExec));
-        HIP_CHECK(hipFree(d_arrayA));
-        HIP_CHECK(hipFree(d_arrayB));
-        HIP_CHECK(hipGraphDestroy(graph));
         HIP_CHECK(hipStreamDestroy(captureStream));
     }
 
-Direct graph creation
-=================================
 
-Graphs can also be created directly using the HIP graph API, giving more fine-grained control over the graph.
-The nodes are represented by ``hipGraphNode_t``, and the specific parameters
-have a separate type each, e.g. ``hipKernelNodeParams``. Depending on the
-operation, the function to call for adding the node varies. For kernel nodes
-it is ``hipGraphAddKernelNode``, or for memory copies it is ``hipGraphAddMemcpyNode``.
-For a full list see the :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
+Explicit graph creation
+================================================================================
+
+Graphs can also be created directly using the HIP graph API, giving more
+fine-grained control over the graph. In this case, the graph nodes are created
+explicitly, together with their parameters and dependencies, which specify the
+edges of the graph, thereby forming the graph structure.
+
+The nodes are represented by the generic ``hipGraphNode_t`` type. The actual
+node type is implicitly defined by the specific function used to add the node to
+the graph, for example ``hipGraphAddKernelNode``. See the
+:doc:`HIP graph API documentation<../doxygen/html/group___graph>` for the
+available functions, they are of type ``hipGraphAdd{Type}Node``. Each type of
+node also has a predefined set of parameters depending on the operation, for
+example ``hipKernelNodeParams`` for a kernel launch. See the
+:doc:`documentation for the general hipGraphNodeParams type<../doxygen/html/structhip_graph_node_params>`
+for a list of available parameter types and their members.
+
+The general flow for explicitly creating a graph is usually:
+
+#. Create a graph ``hipGraph_t``
+#. Create the nodes and their parameters and add them to the graph
+
+  #. Define a ``hipGraphNode_t``
+  #. Define the parameter struct for the desired operation, by explicitly setting the appropriate struct's members.
+  #. Use the appropriate ``hipGraphAdd{Type}Node`` function to add the node to the graph.
+
+    #. The dependencies can be defined when adding the node to the graph, or afterwards by using ``hipGraphAddDependencies``
+
+The following code example demonstrates how to explicitly create nodes in order to create a graph.
 
 .. code-block:: cpp
 
     #include <hip/hip_runtime.h>
     #include <vector>
 
-    #define HIP_CHECK(c){if(c != hipSuccess) return -1;}
-
     __global__ void kernelA(double* arrayA, size_t size);
     __global__ void kernelB(int* arrayB, size_t size);
-    __global__ void kernelC(double* arrayA, int* arrayB, size_t size);
+    __global__ void kernelC(double* arrayA, const int* arrayB, size_t size);
 
-    int main(){
+    struct set_vector_args{
+    	std::vector<double>& h_array;
+    	double value;
+    };
 
-        size_t array_size = 1U << 20;
-        int numberOfBlocks = 1024;
-        int threadsPerBlock = 1024;
+    void set_vector(void* args){
+    	set_vector_args h_args{*(reinterpret_cast<set_vector_args*>(args))};
 
-        double* d_arrayA;
-        int* d_arrayB;
-        std::vector<double> h_array(array_size);
+    	std::vector<double>& vec{h_args.h_array};
+    	vec.assign(vec.size(), h_args.value);
+    }
 
-        HIP_CHECK(hipMalloc(&d_arrayA, array_size * sizeof(*d_arrayA)));
-        HIP_CHECK(hipMalloc(&d_arrayB, array_size * sizeof(*d_arrayB)));
+    void explicit_graph_example(){
+        constexpr int numOfBlocks = 1024;
+        constexpr int threadsPerBlock = 1024;
+        size_t arraySize = 1U << 20;
 
-        // Set up parameters for kernel and copy nodes
-        hipKernelNodeParams kernelAParams, kernelBParams, kernelCParams;
-        hipMemcpy3DParms cpyToDevParams, cpyToHostParams;
+        // The pointers to the device memory don't need to be declared here,
+        // they are contained within the hipMemAllocNodeParams as the dptr member
+        std::vector<double> h_array(arraySize);
+        constexpr double initValue = 2.0;
 
-        void* kernelAArgs[] = {static_cast<void*>(&d_arrayA), static_cast<void*>(&array_size)};
+        // Create graph an empty graph
+        hipGraph_t graph;
+        HIP_CHECK(hipGraphCreate(&graph, 0));
+
+        // Parameters to allocate arrays
+        hipMemAllocNodeParams allocArrayAParams{};
+        allocArrayAParams.poolProps.allocType = hipMemAllocationTypePinned;
+        allocArrayAParams.poolProps.location.type = hipMemLocationTypeDevice;
+        allocArrayAParams.poolProps.location.id = 0; // GPU on which memory resides
+        allocArrayAParams.bytesize = arraySize * sizeof(double);
+
+        hipMemAllocNodeParams allocArrayBParams{};
+        allocArrayBParams.poolProps.allocType = hipMemAllocationTypePinned;
+        allocArrayBParams.poolProps.location.type = hipMemLocationTypeDevice;
+        allocArrayBParams.poolProps.location.id = 0; // GPU on which memory resides
+        allocArrayBParams.bytesize = arraySize * sizeof(int);
+
+        // Add the allocation nodes to the graph. They don't have any dependencies
+        hipGraphNode_t allocNodeA, allocNodeB;
+        HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeA, graph, nullptr, 0, &allocArrayAParams));
+        HIP_CHECK(hipGraphAddMemAllocNode(&allocNodeB, graph, nullptr, 0, &allocArrayBParams));
+
+        // Parameters for the host function
+        // Needs custom struct to pass the arguments
+        set_vector_args args{h_array, initValue};
+        hipHostNodeParams hostParams{};
+        hostParams.fn = set_vector;
+        hostParams.userData = static_cast<void*>(&args);
+
+        // Add the host node that initializes the host array. It also doesn't have any dependencies
+        hipGraphNode_t hostNode;
+        HIP_CHECK(hipGraphAddHostNode(&hostNode, graph, nullptr, 0, &hostParams));
+
+        // Add memory copy node, that copies the initialized host array to the device.
+        // It has to wait for the host array to be initialized and the device memory to be allocated
+        hipGraphNode_t cpyNodeDependencies[] = {allocNodeA, hostNode};
+        hipGraphNode_t cpyToDevNode;
+        HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToDevNode, graph, cpyNodeDependencies, 1, allocArrayAParams.dptr, h_array.data(), arraySize * sizeof(double), hipMemcpyHostToDevice));
+
+        // Parameters for kernelA
+        hipKernelNodeParams kernelAParams;
+        void* kernelAArgs[] = {&allocArrayAParams.dptr, static_cast<void*>(&arraySize)};
         kernelAParams.func = reinterpret_cast<void*>(kernelA);
-        kernelAParams.gridDim = numberOfBlocks;
+        kernelAParams.gridDim = numOfBlocks;
         kernelAParams.blockDim = threadsPerBlock;
         kernelAParams.sharedMemBytes = 0;
         kernelAParams.kernelParams = kernelAArgs;
         kernelAParams.extra = nullptr;
 
-        void* kernelBArgs[] = {static_cast<void*>(&d_arrayB), static_cast<void*>(&array_size)};
-        kernelBParams.func = reinterpret_cast<void*>(kernelB);
-        kernelAParams.gridDim = numberOfBlocks;
-        kernelAParams.blockDim = threadsPerBlock;
-        kernelAParams.sharedMemBytes = 0;
-        kernelAParams.kernelParams = kernelBArgs;
-        kernelAParams.extra = nullptr;
+        // Add the node for kernelA. It has to wait for the memory copy to finish, as it depends on the values from the host array.
+        hipGraphNode_t kernelANode;
+        HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, &cpyToDevNode, 1, &kernelAParams));
 
-        void* kernelCArgs[] = {static_cast<void*>(&d_arrayA), static_cast<void*>(&d_arrayB), static_cast<void*>(&array_size)};
+        // Parameters for kernelB
+        hipKernelNodeParams kernelBParams;
+        void* kernelBArgs[] = {&allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
+        kernelBParams.func = reinterpret_cast<void*>(kernelB);
+        kernelBParams.gridDim = numOfBlocks;
+        kernelBParams.blockDim = threadsPerBlock;
+        kernelBParams.sharedMemBytes = 0;
+        kernelBParams.kernelParams = kernelBArgs;
+        kernelBParams.extra = nullptr;
+
+        //  Add the node for kernelB. It only has to wait for the memory to be allocated, as it initializes the array.
+        hipGraphNode_t kernelBNode;
+        HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, &allocNodeB, 1, &kernelBParams));
+
+        // Parameters for kernelC
+        hipKernelNodeParams kernelCParams;
+        void* kernelCArgs[] = {&allocArrayAParams.dptr, &allocArrayBParams.dptr, static_cast<void*>(&arraySize)};
         kernelCParams.func = reinterpret_cast<void*>(kernelC);
-        kernelAParams.gridDim = numberOfBlocks;
-        kernelAParams.blockDim = threadsPerBlock;
-        kernelAParams.sharedMemBytes = 0;
-        kernelAParams.kernelParams = kernelCArgs;
-        kernelAParams.extra = nullptr;
-
-        cpyToDevParams.srcArray = nullptr;
-        cpyToDevParams.srcPos = make_hipPos(0, 0, 0);
-        cpyToDevParams.srcPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1);
-        cpyToDevParams.dstArray = nullptr;
-        cpyToDevParams.dstPos = make_hipPos(0, 0, 0);
-        cpyToDevParams.dstPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1);
-        cpyToDevParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1);
-        cpyToDevParams.kind = hipMemcpyHostToDevice;
-
-        cpyToHostParams.srcArray = nullptr;
-        cpyToHostParams.srcPos = make_hipPos(0, 0, 0);
-        cpyToHostParams.srcPtr = make_hipPitchedPtr(d_arrayA, array_size * sizeof(*d_arrayA), array_size, 1);
-        cpyToHostParams.dstArray = nullptr;
-        cpyToHostParams.dstPos = make_hipPos(0, 0, 0);
-        cpyToHostParams.dstPtr = make_hipPitchedPtr(h_array.data(), array_size * sizeof(h_array[0]), array_size, 1);
-        cpyToHostParams.extent = make_hipExtent(array_size * sizeof(*d_arrayA), 1, 1);
-        cpyToHostParams.kind = hipMemcpyDeviceToHost;
-
-        // Create graph and add nodes with their respective parameters
-        hipGraph_t graph;
-        hipGraphNode_t kernelANode, kernelBNode, kernelCNode, cpyToDevNode, cpyToHostNode;
-        HIP_CHECK(hipGraphCreate(&graph, 0));
-
-        // Add copy operations
-        HIP_CHECK(hipGraphAddMemcpyNode(&cpyToDevNode, graph, nullptr, 0, &cpyToDevParams));
-        HIP_CHECK(hipGraphAddMemcpyNode(&cpyToHostNode, graph, nullptr, 0, &cpyToHostParams));
-
-        // Add kernels to graph
-        HIP_CHECK(hipGraphAddKernelNode(&kernelANode, graph, nullptr, 0, &kernelAParams));
-        HIP_CHECK(hipGraphAddKernelNode(&kernelBNode, graph, nullptr, 0, &kernelBParams));
-        HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, nullptr, 0, &kernelCParams));
-
-        // Add dependencies between nodes
-        // kernels A and B have to wait for the copy operation
-        HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelANode, 1));
-        HIP_CHECK(hipGraphAddDependencies(graph, &cpyToDevNode, &kernelBNode, 1));
-        // kernel C is dependent on kernels A and B
-        HIP_CHECK(hipGraphAddDependencies(graph, &kernelANode, &kernelCNode, 1));
-        HIP_CHECK(hipGraphAddDependencies(graph, &kernelBNode, &kernelCNode, 1));
-        // The copy back to the host has to wait for kernel C to finish
-        HIP_CHECK(hipGraphAddDependencies(graph, &kernelCNode, &cpyToHostNode, 1));
-
-        // Instantiate graph the just created graph in order to execute it
+        kernelCParams.gridDim = numOfBlocks;
+        kernelCParams.blockDim = threadsPerBlock;
+        kernelCParams.sharedMemBytes = 0;
+        kernelCParams.kernelParams = kernelCArgs;
+        kernelCParams.extra = nullptr;
+
+        // Add the node for kernelC. It has to wait on both kernelA and kernelB to finish, as it depends on their results.
+        hipGraphNode_t kernelCNode;
+        hipGraphNode_t kernelCDependencies[] = {kernelANode, kernelBNode};
+        HIP_CHECK(hipGraphAddKernelNode(&kernelCNode, graph, kernelCDependencies, 1, &kernelCParams));
+
+        // Copy the results back to the host. Has to wait for kernelC to finish.
+        hipGraphNode_t cpyToHostNode;
+        HIP_CHECK(hipGraphAddMemcpyNode1D(&cpyToHostNode, graph, &kernelCNode, 1, h_array.data(), allocArrayAParams.dptr, arraySize * sizeof(double), hipMemcpyDeviceToHost));
+
+        // Free array of allocNodeA. It needs to wait for the copy to finish, as kernelC stores its results in it.
+        hipGraphNode_t freeNodeA;
+        HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeA, graph, &cpyToHostNode, 1, allocArrayAParams.dptr));
+        // Free array of allocNodeB. It only needs to wait for kernelC to finish, as it is not written back to the host.
+        hipGraphNode_t freeNodeB;
+        HIP_CHECK(hipGraphAddMemFreeNode(&freeNodeB, graph, &kernelCNode, 1, allocArrayBParams.dptr));
+
+        // Instantiate the graph in order to execute it
         hipGraphExec_t graphExec;
         HIP_CHECK(hipGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0));
 
-        // Launch the executable graph
+        // The graph can be freed after the instantiation if it's not needed for other purposes
+        HIP_CHECK(hipGraphDestroy(graph));
+
+        // Actually launch the graph
         hipStream_t graphStream;
         HIP_CHECK(hipStreamCreate(&graphStream));
         HIP_CHECK(hipGraphLaunch(graphExec, graphStream));
 
         HIP_CHECK(hipGraphExecDestroy(graphExec));
-        HIP_CHECK(hipFree(d_arrayA));
-        HIP_CHECK(hipFree(d_arrayB));
-        HIP_CHECK(hipGraphDestroy(graph));
         HIP_CHECK(hipStreamDestroy(graphStream));
     }
diff --git a/docs/how-to/programming_manual.md b/docs/how-to/programming_manual.md
index 22847adaf9..bac20c9996 100644
--- a/docs/how-to/programming_manual.md
+++ b/docs/how-to/programming_manual.md
@@ -146,7 +146,7 @@ For Linux developers, the link [here](https://github.com/ROCm/hip-tests/blob/dev
 
 ## HIP Graph
 
-HIP graphs are supported. For more details, refer to the [HIP API Guide](../doxygen/html/group___graph) or the [understand section for HIP graphs](../understand/hipgraph).
+HIP graphs are supported. For more details, refer to the [HIP API Guide](../doxygen/html/group___graph) or the [how-to section for HIP graphs](../how-to/hipgraph).
 
 ## Device-Side Malloc
 
diff --git a/docs/index.md b/docs/index.md
index b9c2a788e6..fed497c24d 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -31,7 +31,6 @@ On non-AMD platforms, like NVIDIA, HIP provides header files required to support
 
 * {doc}`./understand/programming_model`
 * {doc}`./understand/hardware_implementation`
-* {doc}`./understand/hipgraph`
 * {doc}`./understand/amd_clr`
 * [Texture fetching](./understand/texture_fetching)
 
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 7a9d56480f..d59455f10b 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -17,7 +17,6 @@ subtrees:
   entries:
   - file: understand/programming_model
   - file: understand/hardware_implementation
-  - file: understand/hipgraph
   - file: understand/amd_clr
   - file: understand/texture_fetching
     title: Texture fetching
@@ -37,7 +36,7 @@ subtrees:
   - file: how-to/virtual_memory
     title: Virtual memory
   - file: how-to/hipgraph
-    title: HIP graph
+    title: HIP graphs
   - file: how-to/faq
 
 - caption: Reference
diff --git a/docs/understand/hipgraph.rst b/docs/understand/hipgraph.rst
deleted file mode 100644
index 12652beb1c..0000000000
--- a/docs/understand/hipgraph.rst
+++ /dev/null
@@ -1,114 +0,0 @@
-.. meta::
-    :description: This chapter provides an overview over the usage of HIP graph.
-    :keywords: ROCm, HIP, graph, stream
-
-.. _understand_HIP_graph:
-
-********************************************************************************
-HIP graph
-********************************************************************************
-
-.. note::
-    The HIP graph API is currently in Beta. Some features can change and might
-    have outstanding issues. Not all features supported by CUDA graphs are yet
-    supported. For a list of all currently supported functions see the
-    :doc:`HIP graph API documentation<../doxygen/html/group___graph>`.
-
-A HIP graph is made up of nodes and edges. The nodes of a HIP graph represent
-the operations performed, while the edges mark dependencies between those
-operations.
-
-The nodes can be one of the following:
-
-- empty nodes
-- nested graphs
-- kernel launches
-- host-side function calls
-- HIP memory functions (copy, memset, ...)
-- HIP events
-- signalling or waiting on external semaphores
-
-The following figure visualizes the concept of graphs, compared to using streams.
-
-.. figure:: ../data/understand/hipgraph/hip_graph.svg
-    :alt: Diagram depicting the difference between using streams to execute
-          kernels with dependencies, resolved by explicitly calling
-          hipDeviceSynchronize, or using graphs, where the edges denote the
-          dependencies.
-
-Node types
---------------------------------------------------------------------------------
-
-The available node types are specified by :cpp:enumerator:`hipGraphNodeType`.
-
-Memory management nodes
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Memory management nodes handle allocating and freeing of memory of a graph.
-Memory management nodes can be created by using the explicit API functions, or
-by capturing :cpp:func:`hipMallocAsync` or :cpp:func:`hipFreeAsync`.
-Unlike the normal memory management API, which is controlled by host-side execution,
-this enables HIP to take care of memory reuse and optimizations.
-The lifetime of memory allocated in a graph begins when the execution reaches the
-node allocating the memory, and ends when either reaching the corresponding
-free node within the graph, or after graph execution with a corresponding
-:cpp:func:`hipFreeAsync` call, or a corresponding :cpp:func:`hipFree` call.
-The memory can also be freed with a free node in a different graph that is
-associated with the same memory address.
-
-The same rules as for normal memory allocations apply for memory allocated and
-freed by nodes, meaning that the nodes that access memory allocated in a graph
-must be ordered after the allocation node and before the freeing node.
-
-These memory allocations can also be set up to allow access from multiple GPUs,
-just like normal allocations. HIP then takes care of allocating and mapping the
-memory to the GPUs. When capturing a graph from a stream, the node sets the
-accessibility according to hipMemPoolSetAccess at the time of capturing.
-
-HIP graph advantages
-================================================================================
-
-The standard way of launching work on GPUs via streams incurs a small overhead
-for each iteration of the operation involved. For kernels that perform large
-operations during an iteration this overhead is usually negligible. However
-in many workloads, such as scientific simulations and AI, a kernel performs a
-small operation for many iterations, and so the overhead of launching kernels
-can be a significant cost on performance.
-
-HIP graphs have been specifically designed to tackle this problem by only
-requiring one launch from the host per iteration, and minimizing that overhead
-by performing most of the initialization beforehand. Graphs can provide
-additional performance benefits, by enabling optimizations that are only
-possible when knowing the dependencies between the operations.
-
-.. figure:: ../data/understand/hipgraph/hip_graph_speedup.svg
-    :alt: Diagram depicting the speed up achievable with HIP graphs compared to
-          HIP streams when launching many short-running kernels.
-    
-    Qualitative presentation of the execution time of many short-running kernels
-    when launched using HIP stream versus HIP graph. This does not include the
-    time needed to set up the graph.
-
-HIP graph usage
-================================================================================
-
-Using HIP graphs to execute your work requires three different steps, where the
-first two are the initial setup and only need to be executed once. First the
-definition of the operations (nodes) and the dependencies (edges) between them.
-The second step is the instantiation of the graph. This takes care of validating
-and initializing the graph, to reduce the overhead when executing the graph.
-
-The third step is the actual execution of the graph, which then takes care of
-launching all the kernels and executing the operations while respecting their
-dependencies and necessary synchronizations as specified.
-
-As HIP graphs require some set up and initialization overhead before their first
-execution, they only provide a benefit for workloads that require many iterations to complete.
-
-Setting up HIP graphs
-================================================================================
-
-HIP graphs can be created by explicitly defining them, or using stream capture.
-For further information on how to use HIP graphs see :ref:`the how-to-chapter about HIP graphs<how_to_HIP_graph>`.
-For the available functions see the
-:doc:`HIP graph API documentation<../doxygen/html/group___graph>`.