diff --git a/docs/how-to/faq.md b/docs/faq.md similarity index 100% rename from docs/how-to/faq.md rename to docs/faq.md diff --git a/docs/how-to/hip_runtime_api.rst b/docs/how-to/hip_runtime_api.rst new file mode 100644 index 0000000000..7e2c9611c2 --- /dev/null +++ b/docs/how-to/hip_runtime_api.rst @@ -0,0 +1,58 @@ +.. meta:: + :description: HIP runtime API reference page + :keywords: AMD, ROCm, HIP, CUDA, HIP runtime API, HIP runtime + +.. _runtime_api_reference: + +******************************************************************************* +HIP runtime API +******************************************************************************* + +This is the full HIP Runtime API how-to page. + +Modules +============================================================ + +The API is organized into modules based on functionality. + +* :ref:`global_enum_defines_reference` +* :ref:`driver_reference` +* :ref:`device_management_reference` +* :ref:`execution_control_reference` +* :ref:`error_handling_reference` +* :ref:`stream_management_reference` +* :ref:`stream_memory_operations_reference` +* :ref:`event_management_reference` +* :ref:`memory_management_reference` + + * :ref:`external_resource_interoperability_reference` + * :ref:`stream_memory_operations_reference` + * :ref:`unified_memory_reference` + * :ref:`virtual_memory_reference` + +* :ref:`peer_to_peer_device_memory_access_reference` +* :ref:`context_management_reference` +* :ref:`module_management_reference` +* :ref:`occupancy_reference` +* :ref:`profiler_control_reference` +* :ref:`launch_api_reference` +* :ref:`texture_management_reference` +* :ref:`runtime_compilation_reference` +* :ref:`callback_activity_apis_reference` +* :ref:`graph_management_reference` +* :ref:`opengl_interop_reference` +* :ref:`surface_object_reference` + +Data structures +============================================================ + +* :doc:`/doxygen/html/annotated` +* :doc:`/doxygen/html/classes` +* :doc:`/doxygen/html/inherits` +* :doc:`/doxygen/html/functions_data_fields` + +Files +============================================================ + +* :doc:`/doxygen/html/files` +* :doc:`/doxygen/html/globals_globals` diff --git a/docs/how-to/cooperative_groups.rst b/docs/how-to/hip_runtime_api/cooperative_groups.rst similarity index 100% rename from docs/how-to/cooperative_groups.rst rename to docs/how-to/hip_runtime_api/cooperative_groups.rst diff --git a/docs/how-to/hipgraph.rst b/docs/how-to/hip_runtime_api/hipgraph.rst similarity index 100% rename from docs/how-to/hipgraph.rst rename to docs/how-to/hip_runtime_api/hipgraph.rst diff --git a/docs/how-to/programming_manual.md b/docs/how-to/hip_runtime_api/memory_management.md similarity index 60% rename from docs/how-to/programming_manual.md rename to docs/how-to/hip_runtime_api/memory_management.md index 22847adaf9..116ed7c5a0 100644 --- a/docs/how-to/programming_manual.md +++ b/docs/how-to/hip_runtime_api/memory_management.md @@ -1,4 +1,4 @@ -# HIP programming manual +# Memory management ## Host Memory @@ -119,94 +119,3 @@ Note, CPU access to the semaphore's memory requires volatile keyword to disable For more details, please check the documentation `HIP-API.pdf`. Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using `hipStreamSynchronize(nullptr)` for synchronization. - -## Direct Dispatch - -HIP runtime has Direct Dispatch enabled by default in ROCM 4.4 on Linux. -With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream). - -For Direct Dispatch, HIP runtime would directly enqueue a packet to the AQL queue (user mode queue on GPU) on the Dispatch API call from the application. That has shown to reduce the latency to launch the first wave on the idle GPU and total time of tiny dispatches synchronized with the host. - -In addition, eliminating the threads in runtime has reduced the variance in the dispatch numbers as the thread scheduling delays and atomics/locks synchronization latencies are reduced. - -This feature can be disabled via setting the following environment variable, -AMD_DIRECT_DISPATCH=0 - -Note, Direct Dispatch is implemented on Linux. It is currently not supported on Windows. - -## HIP Runtime Compilation - -HIP now supports runtime compilation (HIP RTC), the usage of which will provide the possibility of optimizations and performance improvement compared with other APIs via regular offline static compilation. - -HIP RTC APIs accept HIP source files in character string format as input parameters and create handles of programs by compiling the HIP source files without spawning separate processes. - -For more details on HIP RTC APIs, refer to [HIP Runtime API Reference](../doxygen/html/index). - -For Linux developers, the link [here](https://github.com/ROCm/hip-tests/blob/develop/samples/2_Cookbook/23_cmake_hiprtc/saxpy.cpp) shows an example how to program HIP application using runtime compilation mechanism, and a detailed [HIP RTC programming guide](./hip_rtc) is also available. - -## HIP Graph - -HIP graphs are supported. For more details, refer to the [HIP API Guide](../doxygen/html/group___graph) or the [understand section for HIP graphs](../understand/hipgraph). - -## Device-Side Malloc - -HIP-Clang now supports device-side malloc and free. -This implementation does not require the use of `hipDeviceSetLimit(hipLimitMallocHeapSize,value)` nor respects any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed. - -## Use of Per-thread default stream - -The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads. -The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. -The per-thread default stream can be enabled via adding a compilation option, -`-fgpu-default-stream=per-thread`. - -And users can explicitly use `hipStreamPerThread` as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread). - -## Use of Long Double Type - -In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type. - -## Use of `_Float16` Type - -If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, `_Float16` or aggregates containing `_Float16` should not be used as function argument or return type. This is due to lack of stable ABI for `_Float16` on x86_64. Passing `_Float16` or aggregates containing `_Float16` between clang and gcc could cause undefined behavior. - -## FMA and contractions - -By default HIP-Clang assumes `-ffp-contract=fast-honor-pragmas`. -Users can use `#pragma clang fp contract(on|off|fast)` to control `fp` contraction of a block of code. -For x86_64, FMA is off by default since the generic x86_64 target does not -support FMA by default. To turn on FMA on x86_64, either use `-mfma` or `-march=native` -on CPU's supporting FMA. - -When contractions are enabled and the CPU has not enabled FMA instructions, the -GPU can produce different numerical results than the CPU for expressions that -can be contracted. Tolerance should be used for floating point comparisons. - -## Math functions with special rounding modes - -Note: Currently, HIP only supports basic math functions with rounding modern (round to nearest). HIP does not support basic math functions with rounding modes `ru` (round up), `rd` (round down), and `rz` (round towards zero). - -## Creating Static Libraries - -HIP-Clang supports generating two types of static libraries. The first type of static library does not export device functions, and only exports and launches host functions within the same library. The advantage of this type is the ability to link with a non-hipcc compiler such as gcc. The second type exports device functions to be linked by other code objects. However, this requires using hipcc as the linker. - -In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag --emit-static-lib. The second type of library contains relocatable device objects and is generated using `ar`. - -Here is an example to create and use static libraries: - -* Type 1 using `--emit-static-lib`: - - ```cpp - hipcc hipOptLibrary.cpp --emit-static-lib -fPIC -o libHipOptLibrary.a - gcc test.cpp -L. -lhipOptLibrary -L/path/to/hip/lib -lamdhip64 -o test.out - ``` - -* Type 2 using system `ar`: - - ```cpp - hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o - ar rcsD libHipDevice.a hipDevice.o - hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out - ``` - -For more information, please see [HIP samples host functions](https://github.com/ROCm/hip-tests/tree/develop/samples/2_Cookbook/15_static_library/host_functions) and [device_functions](https://github.com/ROCm/hip-tests/tree/rocm-5.5.x/samples/2_Cookbook/15_static_library/device_functions). diff --git a/docs/how-to/stream_ordered_allocator.rst b/docs/how-to/hip_runtime_api/memory_management/stream_ordered_allocator.rst similarity index 100% rename from docs/how-to/stream_ordered_allocator.rst rename to docs/how-to/hip_runtime_api/memory_management/stream_ordered_allocator.rst diff --git a/docs/how-to/unified_memory.rst b/docs/how-to/hip_runtime_api/memory_management/unified_memory.rst similarity index 100% rename from docs/how-to/unified_memory.rst rename to docs/how-to/hip_runtime_api/memory_management/unified_memory.rst diff --git a/docs/how-to/virtual_memory.rst b/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst similarity index 100% rename from docs/how-to/virtual_memory.rst rename to docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst diff --git a/docs/index.md b/docs/index.md index 0bdaa9827b..b4346404d8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -22,9 +22,8 @@ The HIP documentation is organized into the following categories: * {doc}`./understand/programming_model` * {doc}`./understand/hip_runtime_api` -* {doc}`./understand/hardware_implementation` -* {doc}`./understand/hipgraph` * {doc}`./understand/amd_clr` +* {doc}`./understand/hardware_implementation` * {doc}`./understand/hardware_capabilities` * [Texture fetching](./understand/texture_fetching) @@ -32,19 +31,19 @@ The HIP documentation is organized into the following categories: :::{grid-item-card} How to -* [Programming manual](./how-to/programming_manual) +* [HIP runtime API](./how-to/hip_runtime_api) + * [Memory management](./how-to/hip_runtime_api/memory_management) + * [Unified memory](./how-to/hip_runtime_api/memory_management/unified_memory) + * [Virtual memory](./how-to/hip_runtime_api/memory_management/virtual_memory) + * {doc}`./how-to/hip_runtime_api/memory_management/stream_ordered_allocator` + * [HIP graphs](./how-to/hip_runtime_api/hipgraph) + * [Cooperative groups](./how-to/hip_runtime_api/cooperative_groups) * [HIP porting guide](./how-to/hip_porting_guide) -* [HIP porting: driver API guide](./how-to/hip_porting_driver_api) + * [HIP porting: driver API guide](./how-to/hip_porting_driver_api) * {doc}`./how-to/hip_rtc` * {doc}`./how-to/performance_guidelines` * [Debugging with HIP](./how-to/debugging) * {doc}`./how-to/logging` -* [Unified memory](./how-to/unified_memory) -* [Virtual memory](./how-to/virtual_memory) -* {doc}`./how-to/stream_ordered_allocator` -* [Cooperative groups](./how-to/cooperative_groups) -* [HIP graphs](./how-to/hipgraph) -* {doc}`./how-to/faq` ::: @@ -55,11 +54,10 @@ The HIP documentation is organized into the following categories: * {doc}`doxygen/html/annotated_data_structures` * {doc}`doxygen/html/files_files` * [HSA runtime API for ROCm](./reference/virtual_rocr) -* [HIP math API](./reference/math_api) -* [C++ language extensions](./reference/cpp_language_extensions) -* [C++ language support](./reference/cpp_language_support) * [Driver API](./reference/driver_api_reference) -* [HIP math API](./reference/math_api) +* [C++ language extensions](./reference/cpp_language_extensions) + * [C++ language support](./reference/cpp_language_support) + * [HIP math API](./reference/math_api) * [HIP environment variables](./reference/env_variables) * [Comparing syntax for different APIs](./reference/terms) * [List of deprecated APIs](./reference/deprecated_api_list) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 0482243dbf..c4a88f277f 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -34,22 +34,25 @@ subtrees: - caption: How to entries: - - file: how-to/programming_manual + - file: how-to/hip_runtime_api + subtrees: + - entries: + - file: how-to/hip_runtime_api/memory_management + subtrees: + - entries: + - file: how-to/hip_runtime_api/memory_management/unified_memory + - file: how-to/hip_runtime_api/memory_management/stream_ordered_allocator + - file: how-to/hip_runtime_api/memory_management/virtual_memory + - file: how-to/hip_runtime_api/hipgraph + - file: how-to/hip_runtime_api/cooperative_groups - file: how-to/hip_porting_guide - - file: how-to/hip_porting_driver_api + subtrees: + - entries: + - file: how-to/hip_porting_driver_api - file: how-to/hip_rtc - file: how-to/performance_guidelines - file: how-to/debugging - file: how-to/logging - - file: how-to/cooperative_groups - - file: how-to/unified_memory - title: Unified memory - - file: how-to/virtual_memory - title: Virtual memory - - file: how-to/stream_ordered_allocator - - file: how-to/hipgraph - title: HIP graphs - - file: how-to/faq - caption: Reference entries: @@ -91,10 +94,11 @@ subtrees: - file: reference/virtual_rocr - file: reference/cpp_language_extensions title: C++ language extensions - - file: reference/cpp_language_support - title: C++ language support + subtrees: + - entries: + - file: reference/cpp_language_support + - file: reference/math_api - file: reference/driver_api_reference - - file: reference/math_api - file: reference/env_variables - file: reference/terms title: Comparing syntax for different APIs @@ -116,3 +120,4 @@ subtrees: - caption: About entries: - file: license.md + - file: faq.md diff --git a/docs/understand/hip_runtime_api.rst b/docs/understand/hip_runtime_api.rst index ed8e068d2e..e5e3f84d07 100644 --- a/docs/understand/hip_runtime_api.rst +++ b/docs/understand/hip_runtime_api.rst @@ -273,11 +273,6 @@ The following figure visualizes the concept of graphs, compared to using streams hipDeviceSynchronize, or using graphs, where the edges denote the dependencies. -Node types -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. doxygenenum:: hipGraphNodeType - Memory management nodes ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/what_is_hip.rst b/docs/what_is_hip.rst index 841689d4eb..66c390f6d6 100644 --- a/docs/what_is_hip.rst +++ b/docs/what_is_hip.rst @@ -19,13 +19,13 @@ to build and run application kernels. :align: center * HIP is a thin API with little or no performance impact over coding directly - in NVIDIA CUDA or AMD :doc:`ROCm `. + in NVIDIA CUDA or AMD :doc:`ROCm `. * HIP enables coding in a single-source C++ programming language including features such as templates, C++11 lambdas, classes, namespaces, and more. * Developers can specialize for the platform (CUDA or ROCm) to tune for performance or handle tricky cases. -:doc:`ROCm ` offers compilers (``clang``, ``hipcc``), code +:doc:`ROCm ` offers compilers (``clang``, ``hipcc``), code profilers (``rocprof``, ``omnitrace``), debugging tools (``rocgdb``), libraries and HIP with the runtime API and kernel language, to create heterogeneous applications running on both CPUs and GPUs. ROCm provides marshalling libraries like