From 9a161d82b119f643a4e229579340955eb2b6213f Mon Sep 17 00:00:00 2001 From: Peter Jun Park Date: Wed, 5 Jun 2024 09:35:00 -0400 Subject: [PATCH] Impl editorial suggestions from @lpaoletti Add IPs to wordlist --- .wordlist.txt | 3 +- docs/tutorials/saxpy.rst | 426 +++++++++++++++++++-------------------- 2 files changed, 211 insertions(+), 218 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index eb2fffa8ba..e22afa6da0 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -51,6 +51,7 @@ interoperate Intrinsics intrinsics IPC +IPs isa Lapack latencies @@ -100,4 +101,4 @@ syntaxes tradeoffs typedefs WinGDB -zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz \ No newline at end of file +zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz diff --git a/docs/tutorials/saxpy.rst b/docs/tutorials/saxpy.rst index 8da7e95a50..a27c33cab4 100644 --- a/docs/tutorials/saxpy.rst +++ b/docs/tutorials/saxpy.rst @@ -6,51 +6,48 @@ Tutorial: SAXPY - Hello, HIP ******************************************************************************* -This tutorial will show you some of the basic concepts of the single-source HIP -programming model, the most essential tooling around it and briefly rehash some -commonalities of heterogenous APIs in general. Mild familiarity with the C/C++ -compilation model and the language is assumed throughout this article. +This tutorial explains the basic concepts of the single-source +Heterogeneous-computing Interface for Portability (HIP) programming model and +the essential tooling around it. It also reviews some commonalities of +heterogenous APIs in general. This topic assumes basic familiarity with the +C/C++ compilation model and language. Prerequisites ============= -In order to follow this tutorial you will need properly installed drivers and a -HIP compiler toolchain to compile your code. Because HIP provided by ROCm -supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs -alike, the combination of install instructions are more then worth covering as -part of this tutorial. Please refer to :doc:`/install/install` on how to -install HIP development packages. +To follow this tutorial, you'll need installed drivers and a HIP compiler +toolchain to compile your code. Because HIP for ROCm supports compiling and +running on Linux and Windows with AMD and NVIDIA GPUs, the combination of +install instructions is more than worth covering as part of this tutorial. For +more information about installing HIP development packages, see +:doc:`/install/install`. -Heterogenous Programming -======================== +.. _hip-tutorial-saxpy-heterogeneous-programming -Heterogenous programming and offloading APIs are often mentioned together. -Heterogenous programming deals with devices of varying capabilities at once -while the term offloading focuses on the "remote" and asynchronous aspect of -the computation. HIP encompasses both: it exposes GPGPU (General Purpose GPU) -programming much like ordinary host-side CPU programming and let's us move data -to and from device as need be. +Heterogeneous programming +========================= -.. note:: +*Heterogeneous programming* and *offloading APIs* are often said in the same +breath. Heterogeneous programming deals with devices of varying capabilities +simultaneously. Offloading focuses on the "remote" and asynchronous aspects of +computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) +programming much like ordinary host-side CPU programming and lets you move data +across various devices. - Most of the HIP-specific code will deal with "where", "what" and "how". +When programming in HIP (and other heterogenous APIs for that matter), remember +that target devices are built for a specific purpose. They are designed with +different tradeoffs than traditional CPUs and therefore have very different +performance characteristics. Even subtle changes in code might adversely affect +execution time. -One thing to keep in mind while programming in HIP (and other heterogenous APIs -for that matter), is that target devices are built for a specific purpose. They -are designed with different tradeoffs than traditional CPUs and therefore have -very different performance characteristics. Ever so subtle changes in code may -effect execution time adversely. Rest assured one can always find the root -cause and design a fix. - -Your First Piece of HIP Code +Your first lines of HIP code ============================ -First, let's take the "Hello, World!" of GPGPU: SAXPY. The name comes from -math, a vector equation at that: :math:`a\cdot x+y=z` where -:math:`a\in\mathbb{R}` is a scalar and :math:`x,y,z\in\mathbb{V}` are vector -quantities of some large dimensionality. (Our vector space is defined over the -set of reals.) From a practical perspective we can compute this using a single -``for`` loop over 3 arrays. +First, let's do the "Hello, World!" of GPGPU: SAXPY. *SAXPY* is a mathematical +acronym; a vector equation :math:`a\cdot x+y=z` where :math:`a\in\mathbb{R}` is +a scalar and :math:`x,y,z\in\mathbb{V}` are vector quantities of some large +dimensionality. This vector space is defined over the set of reals. Practically +speaking, you can compute this using a single ``for`` loop over three arrays. .. code-block:: C++ @@ -58,28 +55,29 @@ set of reals.) From a practical perspective we can compute this using a single z[i] = a * x[i] + y[i]; In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this -operation is defined as AXPY "A times X Plus Y". The "S" comes from -single-precision, meaning that every element of our array are ``float`` -s. (We -choose IEEE 754: binary32 arithmetic as the representation of our algebra.) +operation is defined as AXPY "A times X Plus Y". The term SAXPY refers to the +single-precision version of this operation + +The "S" comes from +*single-precision*, meaning that array element is ``float`` -s +(IEEE 754 binary32 representation). -To get quickly off the ground, we'll take off-the-shelf piece of code, the set -of `HIP samples from GitHub `_. Assuming -Git is on your Path, open a command-line and navigate to where you'll want to -work, then issue: +To quickly get started, use the set of `HIP samples from GitHub +`_. With Git configured on your machine, +open a command-line and navigate to your desired working directory, then run: .. code-block:: shell git clone https://github.com/amd/rocm-examples.git -Inside the repository, you should find ``HIP-Basic\saxpy\main.hip``, which is a -sufficiently simple implementation of SAXPY. It was already mentioned -that HIP code will mostly deal with where and when data has to be and -how devices will transform it. The very first HIP calls deal with -allocating device-side memory and dispatching data from host-side -memory in a C Runtime-like fashion. +A simple implementation of SAXPY resides in the ``HIP-Basic/saxpy/main.hip`` +file in this repository. The HIP code here mostly deals with where data +has to be and when, and how devices transform this data. The first HIP calls +deal with allocating device-side memory and dispatching data from host-side +memory in a C runtime-like fashion. .. code-block:: C++ - + // Allocate and copy vectors to device memory. float* d_x{}; float* d_y{}; @@ -90,14 +88,14 @@ memory in a C Runtime-like fashion. ``HIP_CHECK`` is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to -the console. It's not quintessential to the API. +the console. It's not essential to the API. -If you're wondering: how does it know which device to allocate on / dispatch -to... wonder no more. Commands are issued to the HIP runtime on a per-thread -basis and every thread has a device set as the target of commands. The default -device is ``0``, which is equivalent to calling ``hipSetDevice(0)``. +The code selects the device to allocate to and to dispatch to. Commands are +issued to the HIP runtime per thread, and every thread has a device set as the +target of commands. The default device is ``0``, which is equivalent to calling +``hipSetDevice(0)``. -Once our data has been dispatched, we can launch our calculation on the device. +Launch the calculation on the device after the data has been dispatched. .. code-block:: C++ @@ -114,7 +112,7 @@ Once our data has been dispatched, we can launch our calculation on the device. saxpy_kernel<<>>(a, d_x, d_y, size); } -First let's discuss the signature of the offloaded function: +Analyze at the signature of the offloaded function: - ``__global__`` instructs the compiler to generate code for this function as an entrypoint to a device program, such that it can be launched from the host. @@ -135,19 +133,19 @@ First let's discuss the signature of the offloaded function: two pointers to the function. We'll be canonically reusing one of the inputs as outputs. -There's quite a lot to unpack already. How is this function launched from the -host? Using a language extension, the so-called triple chevron syntax. Inside -the angle brackets we can provide the following: +This function is launched from the host using a language extension often called +the triple chevron syntax. Inside the angle brackets, provide the following. - The number of blocks to launch (our grid size) - The number of threads in a block (our block size) - The amount of shared memory to allocate by the host - The device stream to enqueue the operation on -The block size and shared memory will become important later in -:doc:`reduction`, for the time being a hardcoded ``256`` is a safe default for -simple kernels, such as this. Following the triple chevron is ordinary function -argument passing. Now let's take a look how the kernel is implemented. +The block size and shared memory become important later in :doc:`reduction`. For +now, a hardcoded ``256`` is a safe default for simple kernels such as this. +Following the triple chevron is ordinary function argument passing. + +Look at how the kernel is implemented. .. code-block:: C++ @@ -169,98 +167,97 @@ argument passing. Now let's take a look how the kernel is implemented. - A check is made to avoid overindexing the input. - The useful part of the computation is carried out. -Retrieval of the result from the device is done much like its dispatch: +Retrieval of the result from the device is done much like its dispatching: .. code-block:: C++ HIP_CHECK(hipMemcpy(y.data(), d_y, size_bytes, hipMemcpyDeviceToHost)); -Compiling on the Command-Line +Compiling on the command line ============================= .. _setting_up_the_command-line: -Setting Up the Command-Line +Setting up the command line --------------------------- -While strictly speaking there's no such thing as "setting up the command-line -for compilation" on Linux, just to make invocations more terse let's do it on -both Linux and Windows. +Strictly speaking there's no such thing as "setting up the command-line +for compilation" on Linux. To make invocations more terse, Linux and Windows +example follow. .. tab-set:: - .. tab-item:: Linux & AMD + .. tab-item:: Linux and AMD :sync: linux-amd - While distro maintainers may package ROCm such that they install to - system-default locations, AMD's installation don't and need to be added to the - Path by the user. + While distro maintainers might package ROCm so that it installs to + system-default locations, AMD's installation packages aren't. They need to + be added to the PATH by the user. .. code-block:: bash - + export PATH=/opt/rocm/bin:${PATH} - You should be able to call the compiler on the command-line now: + You should be able to call the compiler on the command line now: .. code-block:: bash - + amdclang++ --version .. note:: Docker images distributed by AMD, such as - `rocm-terminal `_ already have - `/opt/rocm/bin` on the Path for convenience. (This subtly affects CMake package - detection logic of ROCm libraries.) + `rocm-terminal `_ already + have `/opt/rocm/bin` on the Path for convenience. This subtly affects + CMake package detection logic of ROCm libraries. - .. tab-item:: Linux & NVIDIA + .. tab-item:: Linux and NVIDIA :sync: linux-nvidia - Both distro maintainers and NVIDIA package CUDA as such that ``nvcc`` and related - tools are on the command-line by default. You should be able to call the - compiler on the command-line simply: + Both distro maintainers and NVIDIA package CUDA so that ``nvcc`` and related + tools are available on the command line by default. You can call the + compiler on the command line with: .. code-block:: bash - + nvcc --version - .. tab-item:: Windows & AMD + .. tab-item:: Windows and AMD :sync: windows-amd - Windows compilers and command-line tooling have traditionally - relied on extra environmental variables and Path entries to function correctly. - Visual Studio refers to command-lines with these setup as "Developer + Windows compilers and command line tooling have traditionally relied on + extra environmental variables and PATH entries to function correctly. + Visual Studio refers to command lines with this setup as "Developer Command Prompt" or "Developer PowerShell" for ``cmd.exe`` and PowerShell respectively. - The HIP SDK on Windows doesn't ship a complete toolchain, you will also need: + The HIP SDK on Windows doesn't include a complete toolchain. You will also + need: - - the Windows SDK, most crucially providing the import libs to crucial system - libraries all executables must link to and some auxiliary compiler tooling. - - a Standard Template Library, aka. STL, which HIP too relies on. + - The Microsoft Windows SDK. It provides the import libs to crucial system + libraries that all executables must link to and some auxiliary compiler + tooling. + - A Standard Template Library (STL). Installed as part of the Microsoft + Visual C++ compiler (MSVC) or with Visual Studio. - The prior may be installed separately, though it's most conveniently obtained - through the Visual Studio installer, while the latter is part of the Microsoft - Visual C++ compiler, aka. MSVC, also installed via Visual Studio. - - If you don't already have some SKU of Visual Studio 2022 installed, for a - minimal command-line experience, install the + If you don't have a version of Visual Studio 2022 installed, for a + minimal command line experience, install the `Build Tools for Visual Studio 2022 `_ - with the Desktop Developemnt Workload and under Individual Components select: + with the Desktop Developemnt Workload. Under Individual Components select: - - some version of the Windows SDK + - A version of the Windows SDK - "MSVC v143 - VS 2022 C++ x64/x86 build tools (Latest)" - "C++ CMake tools for Windows" (optional) .. note:: The "C++ CMake tools for Windows" individual component is a convenience which - puts both ``cmake.exe`` and ``ninja.exe`` onto the ``PATH`` inside developer - command-prompts. You can install these manually, but then you need to manage + puts both ``cmake.exe`` and ``ninja.exe`` onto the PATH inside developer + command prompts. You can install these manually, but then you must manage them manually. - Visual Studio installations as of VS 2017 are detectable as COM object - instances via WMI. To setup a command-line from any shell for the latest - Visual Studio's default (latest) Visual C++ toolset issue: + Visual Studio 2017 and later are detectable as COM object instances via WMI. + To setup a command line from any shell for the latest Visual Studio's + default Visual C++ toolset issue: .. code-block:: powershell @@ -269,51 +266,49 @@ both Linux and Windows. Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -HostArch amd64 -DevCmdArguments '-no_logo' $env:PATH = "${env:HIP_PATH}bin;${env:PATH}" - You should be able to call the compiler on the command-line now: + You should be able to call the compiler on the command line now: .. code-block:: powershell clang++ --version - .. tab-item:: Windows & NVIDIA + .. tab-item:: Windows and NVIDIA :sync: windows-nvidia - Windows compilers and command-line tooling have traditionally - relied on extra environmental variables and Path entries to function correctly. - Visual Studio refers to command-lines with these setup as "Developer + Windows compilers and command line tooling have traditionally relied on + extra environmental variables and PATH entries to function correctly. + Visual Studio refers to command lines with this setup as "Developer Command Prompt" or "Developer PowerShell" for ``cmd.exe`` and PowerShell respectively. - The HIP and CUDA SDKs on Windows doesn't ship complete toolchains, you will + The HIP and CUDA SDKs on Windows don't include complete toolchains. You will also need: - - the Windows SDK, most crucially providing the import libs to crucial system - libraries all executables must link to and some auxiliary compiler tooling. - - a Standard Template Library, aka. STL, which HIP too relies on. - - The prior may be installed separately, though it's most conveniently obtained - through the Visual Studio installer, while the latter is part of the Microsoft - Visual C++ compiler, aka. MSVC, also installed via Visual Studio. + - The Microsoft Windows SDK. It provides the import libs to crucial system + libraries that all executables must link to and some auxiliary compiler + tooling. + - A Standard Template Library (STL). Installed as part of the Microsoft + Visual C++ compiler (MSVC) or with Visual Studio. - If you don't already have some SKU of Visual Studio 2022 installed, for a - minimal command-line experience, install the + If you don't have a version of Visual Studio 2022 installed, for a + minimal command line experience, install the `Build Tools for Visual Studio 2022 `_ - with the Desktop Developemnt Workload and under Individual Components select: + with the Desktop Developemnt Workload. Under Individual Components select: - - some version of the Windows SDK + - A version of the Windows SDK - "MSVC v143 - VS 2022 C++ x64/x86 build tools (Latest)" - "C++ CMake tools for Windows" (optional) .. note:: The "C++ CMake tools for Windows" individual component is a convenience which - puts both ``cmake.exe`` and ``ninja.exe`` onto the ``PATH`` inside developer - command-prompts. You can install these manually, but then you need to manage + puts both ``cmake.exe`` and ``ninja.exe`` onto the PATH inside developer + command prompts. You can install these manually, but then you must manage them manually. - Visual Studio installations as of VS 2017 are detectable as COM object - instances via WMI. To setup a command-line from any shell for the latest - Visual Studio's default (latest) Visual C++ toolset issue: + Visual Studio 2017 and later are detectable as COM object instances via WMI. + To setup a command line from any shell for the latest Visual Studio's + default Visual C++ toolset issue: .. code-block:: powershell @@ -321,93 +316,93 @@ both Linux and Windows. Import-Module $InstallationPath\Common7\Tools\Microsoft.VisualStudio.DevShell.dll Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -HostArch amd64 -DevCmdArguments '-no_logo' - You should be able to call the compiler on the command-line now: + You should be able to call the compiler on the command line now: .. code-block:: powershell - + nvcc --version -Invoking the Compiler Manually +Invoking the compiler manually ------------------------------ -To compile and link a single-file application, one may use the following -command: +To compile and link a single-file application, use the following commands: .. tab-set:: - .. tab-item:: Linux & AMD + .. tab-item:: Linux and AMD :sync: linux-amd .. code-block:: bash amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 - .. tab-item:: Linux & NVIDIA + .. tab-item:: Linux and NVIDIA :sync: linux-nvidia .. code-block:: bash nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -O2 -x cu - .. tab-item:: Windows & AMD + .. tab-item:: Windows and AMD :sync: windows-amd .. code-block:: powershell clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2 - .. tab-item:: Windows & NVIDIA + .. tab-item:: Windows and NVIDIA :sync: windows-nvidia .. code-block:: powershell nvcc .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I .\Common -O2 -x cu -Depending on your computer, the resulting binary may or may not run. If not, it -will typically complain about about "Invalid device function". That error +Depending on your computer, the resulting binary might or might not run. If not, +it typically complains about "Invalid device function". That error (corresponding to the ``hipErrorInvalidDeviceFunction`` entry of ``hipError_t``) means that the runtime could not find a device program binary of the appropriate flavor embedded into the executable. -So far we've only talked about how our data makes it from the host to the -device and back. We've also seen our device code as source, but the HIP runtime -was arguing about not finding the right binary to dispatch for execution. How -can one find out what device binary flavors are embedded into the executable? +So far, the discussion has covered how data makes it from the host to the +device and back. It has also discussed the device code as source, with the HIP +runtime arguing that the correct binary to dispatch for execution. How can you +find out what device binary flavors are embedded into the executable? .. tab-set:: - .. tab-item:: Linux & AMD + + .. tab-item:: Linux and AMD :sync: linux-amd - The set of ``roc-*`` utilities shipping with ROCm help significantly to inspect - binary artifacts on disk. If you wish to use these utilities, add the ROCmCC - installation folder to your PATH (the utilities expect them to be on the PATH). + The utilities included with ROCm help significantly to inspect binary + artifacts on disk. Add the ROCmCC installation folder to your PATH if you + want to use these utilities (the utilities expect them to be on the PATH). - Lisitng of the embedded program binaries can be done using ``roc-obj-ls`` + You can list embedded program binaries using ``roc-obj-ls``. .. code-block:: bash roc-obj-ls ./saxpy - It may return something like: + It should return something like: .. code-block:: shell 1 host-x86_64-unknown-linux file://./saxpy#offset=12288&size=0 1 hipv4-amdgcn-amd-amdhsa--gfx803 file://./saxpy#offset=12288&size=9760 - We can see that the compiler embedded a version 4 code object (more on `code + The compiler embeds a version 4 code object (more on `code object versions `_) and used the LLVM target triple `amdgcn-amd-amdhsa--gfx803` (more on `target triples - `_). We can - extract that program object in a disassembled fashion for human consumption via - `roc-obj` + `_). You can + extract that program object in a disassembled fashion for human consumption + via ``roc-obj``. .. code-block:: bash roc-obj -t gfx803 -d ./saxpy - Which will create two files on disk and we'll be interested in the one with the - ``.s`` extension. Opening up said file or dumping it to the console using ``cat`` - one will find the disassembled binary of our saxpy compute kernel, something + This creates two files on disk and ``.s`` extension is of most interest. + Opening this file or dumping it to the console using ``cat`` + lets find the disassembled binary of the SAXPY compute kernel, something similar to: .. code-block:: @@ -442,14 +437,14 @@ can one find out what device binary flavors are embedded into the executable? flat_store_dword v[0:1], v3 // 000000001080: DC700000 00000300 s_endpgm // 000000001088: BF810000 - Alternatively we can call the compiler with ``--save-temps`` to dump all device + Alternatively, call the compiler with ``--save-temps`` to dump all device binary to disk in separate files. .. code-block:: bash amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps - Now we can list all the temporaries created while compiling ``main.hip`` via + List all the temporaries created while compiling ``main.hip`` with: .. code-block:: bash @@ -461,11 +456,11 @@ can one find out what device binary flavors are embedded into the executable? main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt main-hip-amdgcn-amd-amdhsa-gfx803.s - Files with the ``.s`` extension hold the disassembled contents of the binary and - the filename directly informs us of the graphics IPs used by the compiler. The - contents of this file is very similar to what ``roc-obj`` printed to the console. + Files with the ``.s`` extension hold the disassembled contents of the binary. + The filename notes the graphics IPs used by the compiler. The contents of + this file are similar to what ``roc-obj`` printed to the console. - .. tab-item:: Linux & NVIDIA + .. tab-item:: Linux and NVIDIA :sync: linux-nvidia Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting @@ -477,11 +472,11 @@ can one find out what device binary flavors are embedded into the executable? .. code-block:: bash - cuobjdump --list-ptx ./saxpy + cuobjdump --list-ptx ./saxpy Which will print something like: - .. code-block:: + .. code-block:: PTX file 1: saxpy.1.sm_52.ptx @@ -507,7 +502,7 @@ can one find out what device binary flavors are embedded into the executable? The output may look like: - .. code-block:: + .. code-block:: 000000014004C000: 5F474E414C435F5F 5F44414F4C46464F __CLANG_OFFLOAD_ 000000014004C010: 5F5F454C444E5542 0000000000000002 BUNDLE__........ @@ -524,7 +519,7 @@ can one find out what device binary flavors are embedded into the executable? We can see that the compiler embedded a version 4 code object (more on code `object versions `_) and - used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples + used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples `_). Don't be alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to the GPU for execution are proper linux ELF binaries in their format. @@ -595,40 +590,40 @@ can one find out what device binary flavors are embedded into the executable? s_endpgm ... - .. tab-item:: Windows & NVIDIA + .. tab-item:: Windows and NVIDIA :sync: windows-nvidia - Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting - binary will be a valid CUDA executable as far as the binary goes. Therefor - it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set - Architecture) instead of AMDGPU binary. As s result, tooling shipping with the - CUDA SDK can be used to inspect which device ISA got compiled into a specific - executable. The tool most useful to us currently is ``cuobjdump``. + Unlike HIP on AMD, when compiling using the NVIDIA support for HIP, the resulting + binary will be a valid CUDA executable. Therefore, it'll incorporate PTX ISA + (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU + binary. As a result, tooling included with the CUDA SDK can be used to + inspect which device ISA was compiled into a specific executable. The most + helpful to us currently is ``cuobjdump``. .. code-block:: bash cuobjdump.exe --list-ptx .\saxpy.exe - Which will print something like: + Which prints something like: - .. code-block:: + .. code-block:: PTX file 1: saxpy.1.sm_52.ptx - From this we can see that the saxpy kernel is stored as ``sm_52``, which shows - that a compute capability 5.2 ISA got embedded into the executable, so devices - which sport compute capability 5.2 or newer will be able to run this code. + This example shows that the SAXPY kernel is stored as ``sm_52``. It also shows + that a compute capability 5.2 ISA was embedded into the executable, so devices + that support compute capability 5.2 or newer will be able to run this code. -Now that we've found what binary got embedded into the executable, we only need -to find which format our available devices use. +Now that you've found what binary got embedded into the executable, find which +format our available devices use. .. tab-set:: - .. tab-item:: Linux & AMD + .. tab-item:: Linux and AMD :sync: linux-amd - On Linux a utility called ``rocminfo`` can help us list all the properties of the + On Linux a utility called ``rocminfo`` helps us list all the properties of the devices available on the system, including which version of graphics IP - (``gfxXYZ``) they employ. We'll filter the output to have only these lines: + (``gfxXYZ``) they employ. You can filter the output to have only these lines: .. code-block:: bash @@ -636,27 +631,23 @@ to find which format our available devices use. Name: gfx906 Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack- - _(For the time being let's not discuss what the colon-dlimited list of device - features are after the graphics IP. Until further notice we'll treat them as - part of the binary version.)_ - - .. tab-item:: Linux & NVIDIA + .. tab-item:: Linux and NVIDIA :sync: linux-nvidia - On Linux HIP with the NVIDIA back-end a CUDA SDK sample called ``deviceQuery`` + On Linux HIP with the NVIDIA back-end, the ``deviceQuery`` CUDA SDK sample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. - (``.`` compute capability is passed to ``nvcc`` on the - command-line as ``sm_``, for eg. ``8.6`` is ``sm_86``.) + ``.`` compute capability is passed to ``nvcc`` on the + command-line as ``sm_``, for eg. ``8.6`` is ``sm_86``. - Because it's not shipped as a binary, we may as well compile the matching + Because it's not included as a binary, compile the matching example from ROCm. .. code-block:: bash nvcc ./HIP-Basic/device_query/main.cpp -o device_query -I ./Common -I /opt/rocm/include -O2 - We'll filter the output to have only the lines of interest, for eg.: + Filter the output to have only the lines of interest, for example: .. code-block:: bash @@ -666,18 +657,18 @@ to find which format our available devices use. .. note:: - Next to the ``nvcc`` executable is another tool called ``__nvcc_device_query`` - which simply prints the SM Architecture numbers to standard out as a comma - separated list of numbers. The naming of this utility suggests it's not a user - facing executable but is used by ``nvcc`` to determine what devices are in the + In addition to the ``nvcc`` executable is another tool called ``__nvcc_device_query`` + which prints the SM Architecture numbers to standard out as a comma + separated list of numbers. The utility's name suggests it's not a user-facing + executable but is used by ``nvcc`` to determine what devices are in the system at hand. - .. tab-item:: Windows & AMD + .. tab-item:: Windows and AMD :sync: windows-amd - On Windows a utility called ``hipInfo.exe`` can help us list all the properties + On Windows, a utility called ``hipInfo.exe`` helps us list all the properties of the devices available on the system, including which version of graphics IP - (``gfxXYZ``) they employ. We'll filter the output to have only these lines: + (``gfxXYZ``) they employ. Filter the output to have only these lines: .. code-block:: powershell @@ -686,23 +677,23 @@ to find which format our available devices use. gcnArchName: gfx1032 gcnArchName: gfx1035 - .. tab-item:: Winodws & NVIDIA + .. tab-item:: Windows and NVIDIA :sync: windows-nvidia - On Windows HIP with the NVIDIA back-end a CUDA SDK sample called ``deviceQuery`` + On Windows HIP with the NVIDIA back-end, the ``deviceQuery`` CUDA SDK sample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. - (``.`` compute capability is passed to ``nvcc`` on the - command-line as ``sm_``, for eg. ``8.6`` is ``sm_86``.) + ``.`` compute capability is passed to ``nvcc`` on the + command-line as ``sm_``, for eg. ``8.6`` is ``sm_86``. - Because it's not shipped as a binary, we may as well compile the matching + Because it's not included as a binary, compile the matching example from ROCm. .. code-block:: powershell nvcc .\HIP-Basic\device_query\main.cpp -o device_query.exe -I .\Common -I ${env:HIP_PATH}include -O2 - We'll filter the output to have only the lines of interest, for eg.: + Filter the output to have only the lines of interest, for example: .. code-block:: powershell @@ -719,26 +710,27 @@ to find which format our available devices use. facing executable but is used by ``nvcc`` to determine what devices are in the system at hand. -Now that we know which versions of graphics IP our devices use, we can -recompile our program with said parameters. +Now that you know which graphics IPs our devices use, recompile your program with +the appropriate parameters. .. tab-set:: - .. tab-item:: Linux & AMD + + .. tab-item:: Linux and AMD :sync: linux-amd .. code-block:: bash amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --offload-arch=gfx906:sramecc+:xnack- - Now our sample will surely run. + Now the sample will run. - .. code-block:: + .. code-block:: ./saxpy Calculating y[i] = a * x[i] + y[i] over 1000000 elements. First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] - .. tab-item:: Linux & NVIDIA + .. tab-item:: Linux and NVIDIA :sync: linux-nvidia .. code-block:: bash @@ -750,22 +742,22 @@ recompile our program with said parameters. If you want to portably target the development machine which is compiling, you may specify ``-arch=native`` instead. - Now our sample will surely run. + Now the sample will run. - .. code-block:: + .. code-block:: ./saxpy Calculating y[i] = a * x[i] + y[i] over 1000000 elements. First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] - .. tab-item:: Windows & AMD + .. tab-item:: Windows and AMD :sync: windows-amd .. code-block:: powershell clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2 --offload-arch=gfx1032 --offload-arch=gfx1035 - Now our sample will surely run. + Now the sample will run. .. code-block:: @@ -773,7 +765,7 @@ recompile our program with said parameters. Calculating y[i] = a * x[i] + y[i] over 1000000 elements. First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ] - .. tab-item:: Windows & NVIDIA + .. tab-item:: Windows and NVIDIA :sync: windows-nvidia .. code-block:: powershell @@ -785,9 +777,9 @@ recompile our program with said parameters. If you want to portably target the development machine which is compiling, you may specify ``-arch=native`` instead. - Now our sample will surely run. + Now the sample will run. - .. code-block:: + .. code-block:: .\saxpy.exe Calculating y[i] = a * x[i] + y[i] over 1000000 elements.