From 9a161d82b119f643a4e229579340955eb2b6213f Mon Sep 17 00:00:00 2001
From: Peter Jun Park <peter.park@amd.com>
Date: Wed, 5 Jun 2024 09:35:00 -0400
Subject: [PATCH] Impl editorial suggestions from @lpaoletti

Add IPs to wordlist
---
 .wordlist.txt            |   3 +-
 docs/tutorials/saxpy.rst | 426 +++++++++++++++++++--------------------
 2 files changed, 211 insertions(+), 218 deletions(-)

diff --git a/.wordlist.txt b/.wordlist.txt
index eb2fffa8ba..e22afa6da0 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -51,6 +51,7 @@ interoperate
 Intrinsics
 intrinsics
 IPC
+IPs
 isa
 Lapack
 latencies
@@ -100,4 +101,4 @@ syntaxes
 tradeoffs
 typedefs
 WinGDB
-zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
\ No newline at end of file
+zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
diff --git a/docs/tutorials/saxpy.rst b/docs/tutorials/saxpy.rst
index 8da7e95a50..a27c33cab4 100644
--- a/docs/tutorials/saxpy.rst
+++ b/docs/tutorials/saxpy.rst
@@ -6,51 +6,48 @@
 Tutorial: SAXPY - Hello, HIP
 *******************************************************************************
 
-This tutorial will show you some of the basic concepts of the single-source HIP
-programming model, the most essential tooling around it and briefly rehash some
-commonalities of heterogenous APIs in general. Mild familiarity with the C/C++
-compilation model and the language is assumed throughout this article.
+This tutorial explains the basic concepts of the single-source
+Heterogeneous-computing Interface for Portability (HIP) programming model and
+the essential tooling around it. It also reviews some commonalities of
+heterogenous APIs in general. This topic assumes basic familiarity with the
+C/C++ compilation model and language.
 
 Prerequisites
 =============
 
-In order to follow this tutorial you will need properly installed drivers and a
-HIP compiler toolchain to compile your code. Because HIP provided by ROCm
-supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs
-alike, the combination of install instructions are more then worth covering as
-part of this tutorial. Please refer to :doc:`/install/install` on how to
-install HIP development packages.
+To follow this tutorial, you'll need installed drivers and a HIP compiler
+toolchain to compile your code. Because HIP for ROCm supports compiling and
+running on Linux and Windows with AMD and NVIDIA GPUs, the combination of
+install instructions is more than worth covering as part of this tutorial. For
+more information about installing HIP development packages, see
+:doc:`/install/install`.
 
-Heterogenous Programming
-========================
+.. _hip-tutorial-saxpy-heterogeneous-programming
 
-Heterogenous programming and offloading APIs are often mentioned together.
-Heterogenous programming deals with devices of varying capabilities at once
-while the term offloading focuses on the "remote" and asynchronous aspect of
-the computation. HIP encompasses both: it exposes GPGPU (General Purpose GPU)
-programming much like ordinary host-side CPU programming and let's us move data
-to and from device as need be.
+Heterogeneous programming
+=========================
 
-.. note::
+*Heterogeneous programming* and *offloading APIs* are often said in the same
+breath. Heterogeneous programming deals with devices of varying capabilities
+simultaneously. Offloading focuses on the "remote" and asynchronous aspects of
+computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU)
+programming much like ordinary host-side CPU programming and lets you move data
+across various devices.
 
-  Most of the HIP-specific code will deal with "where", "what" and "how".
+When programming in HIP (and other heterogenous APIs for that matter), remember
+that target devices are built for a specific purpose. They are designed with
+different tradeoffs than traditional CPUs and therefore have very different
+performance characteristics. Even subtle changes in code might adversely affect
+execution time.
 
-One thing to keep in mind while programming in HIP (and other heterogenous APIs
-for that matter), is that target devices are built for a specific purpose. They
-are designed with different tradeoffs than traditional CPUs and therefore have
-very different performance characteristics. Ever so subtle changes in code may
-effect execution time adversely. Rest assured one can always find the root
-cause and design a fix.
-
-Your First Piece of HIP Code
+Your first lines of HIP code
 ============================
 
-First, let's take the "Hello, World!" of GPGPU: SAXPY. The name comes from
-math, a vector equation at that: :math:`a\cdot x+y=z` where
-:math:`a\in\mathbb{R}` is a scalar and :math:`x,y,z\in\mathbb{V}` are vector
-quantities of some large dimensionality. (Our vector space is defined over the
-set of reals.) From a practical perspective we can compute this using a single
-``for`` loop over 3 arrays.
+First, let's do the "Hello, World!" of GPGPU: SAXPY. *SAXPY* is a mathematical
+acronym; a vector equation :math:`a\cdot x+y=z` where :math:`a\in\mathbb{R}` is
+a scalar and :math:`x,y,z\in\mathbb{V}` are vector quantities of some large
+dimensionality. This vector space is defined over the set of reals. Practically
+speaking, you can compute this using a single ``for`` loop over three arrays.
 
 .. code-block:: C++
 
@@ -58,28 +55,29 @@ set of reals.) From a practical perspective we can compute this using a single
         z[i] = a * x[i] + y[i];
 
 In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this
-operation is defined as AXPY "A times X Plus Y". The "S" comes from
-single-precision, meaning that every element of our array are ``float`` -s. (We
-choose IEEE 754: binary32 arithmetic as the representation of our algebra.)
+operation is defined as AXPY "A times X Plus Y". The term SAXPY refers to the 
+single-precision version of this operation
+
+The "S" comes from
+*single-precision*, meaning that array element is ``float`` -s
+(IEEE 754 binary32 representation).
 
-To get quickly off the ground, we'll take off-the-shelf piece of code, the set
-of `HIP samples from GitHub <https://github.com/amd/rocm-examples/>`_. Assuming
-Git is on your Path, open a command-line and navigate to where you'll want to
-work, then issue:
+To quickly get started, use the set of `HIP samples from GitHub
+<https://github.com/amd/rocm-examples/>`_. With Git configured on your machine,
+open a command-line and navigate to your desired working directory, then run:
 
 .. code-block:: shell
 
   git clone https://github.com/amd/rocm-examples.git
 
-Inside the repository, you should find ``HIP-Basic\saxpy\main.hip``, which is a
-sufficiently simple implementation of SAXPY. It was already mentioned
-that HIP code will mostly deal with where and when data has to be and
-how devices will transform it. The very first HIP calls deal with
-allocating device-side memory and dispatching data from host-side
-memory in a C Runtime-like fashion.
+A simple implementation of SAXPY resides in the ``HIP-Basic/saxpy/main.hip``
+file in this repository. The HIP code here mostly deals with where data
+has to be and when, and how devices transform this data. The first HIP calls
+deal with allocating device-side memory and dispatching data from host-side
+memory in a C runtime-like fashion.
 
 .. code-block:: C++
-  
+
   // Allocate and copy vectors to device memory.
   float* d_x{};
   float* d_y{};
@@ -90,14 +88,14 @@ memory in a C Runtime-like fashion.
 
 ``HIP_CHECK`` is a custom macro borrowed from the examples utilities which
 checks the error code returned by API functions for errors and reports them to
-the console. It's not quintessential to the API.
+the console. It's not essential to the API.
 
-If you're wondering: how does it know which device to allocate on / dispatch
-to... wonder no more. Commands are issued to the HIP runtime on a per-thread
-basis and every thread has a device set as the target of commands. The default
-device is ``0``, which is equivalent to calling ``hipSetDevice(0)``.
+The code selects the device to allocate to and to dispatch to. Commands are
+issued to the HIP runtime per thread, and every thread has a device set as the
+target of commands. The default device is ``0``, which is equivalent to calling
+``hipSetDevice(0)``.
 
-Once our data has been dispatched, we can launch our calculation on the device.
+Launch the calculation on the device after the data has been dispatched.
 
 .. code-block:: C++
 
@@ -114,7 +112,7 @@ Once our data has been dispatched, we can launch our calculation on the device.
       saxpy_kernel<<<dim3(grid_size), dim3(block_size), 0, hipStreamDefault>>>(a, d_x, d_y, size);
   }
 
-First let's discuss the signature of the offloaded function:
+Analyze at the signature of the offloaded function:
 
 - ``__global__`` instructs the compiler to generate code for this function as an
   entrypoint to a device program, such that it can be launched from the host.
@@ -135,19 +133,19 @@ First let's discuss the signature of the offloaded function:
   two pointers to the function. We'll be canonically reusing one of the inputs
   as outputs.
 
-There's quite a lot to unpack already. How is this function launched from the
-host? Using a language extension, the so-called triple chevron syntax. Inside
-the angle brackets we can provide the following:
+This function is launched from the host using a language extension often called
+the triple chevron syntax. Inside the angle brackets, provide the following.
 
 - The number of blocks to launch (our grid size)
 - The number of threads in a block (our block size)
 - The amount of shared memory to allocate by the host
 - The device stream to enqueue the operation on
 
-The block size and shared memory will become important later in
-:doc:`reduction`, for the time being a hardcoded ``256`` is a safe default for
-simple kernels, such as this. Following the triple chevron is ordinary function
-argument passing. Now let's take a look how the kernel is implemented.
+The block size and shared memory become important later in :doc:`reduction`. For
+now, a hardcoded ``256`` is a safe default for simple kernels such as this.
+Following the triple chevron is ordinary function argument passing.
+
+Look at how the kernel is implemented.
 
 .. code-block:: C++
 
@@ -169,98 +167,97 @@ argument passing. Now let's take a look how the kernel is implemented.
 - A check is made to avoid overindexing the input.
 - The useful part of the computation is carried out.
 
-Retrieval of the result from the device is done much like its dispatch:
+Retrieval of the result from the device is done much like its dispatching:
 
 .. code-block:: C++
 
   HIP_CHECK(hipMemcpy(y.data(), d_y, size_bytes, hipMemcpyDeviceToHost));
 
-Compiling on the Command-Line
+Compiling on the command line
 =============================
 
 .. _setting_up_the_command-line:
 
-Setting Up the Command-Line
+Setting up the command line
 ---------------------------
 
-While strictly speaking there's no such thing as "setting up the command-line
-for compilation" on Linux, just to make invocations more terse let's do it on
-both Linux and Windows.
+Strictly speaking there's no such thing as "setting up the command-line
+for compilation" on Linux. To make invocations more terse, Linux and Windows
+example follow.
 
 .. tab-set::
-  .. tab-item:: Linux & AMD
+  .. tab-item:: Linux and AMD
     :sync: linux-amd
 
-    While distro maintainers may package ROCm such that they install to
-    system-default locations, AMD's installation don't and need to be added to the
-    Path by the user.
+    While distro maintainers might package ROCm so that it installs to
+    system-default locations, AMD's installation packages aren't. They need to
+    be added to the PATH by the user.
 
     .. code-block:: bash
-      
+
       export PATH=/opt/rocm/bin:${PATH}
 
-    You should be able to call the compiler on the command-line now:
+    You should be able to call the compiler on the command line now:
 
     .. code-block:: bash
-      
+
       amdclang++ --version
 
     .. note::
 
       Docker images distributed by AMD, such as
-      `rocm-terminal <https://hub.docker.com/r/rocm/rocm-terminal/>`_ already have
-      `/opt/rocm/bin` on the Path for convenience. (This subtly affects CMake package
-      detection logic of ROCm libraries.)
+      `rocm-terminal <https://hub.docker.com/r/rocm/rocm-terminal/>`_ already
+      have `/opt/rocm/bin` on the Path for convenience. This subtly affects
+      CMake package detection logic of ROCm libraries.
 
-  .. tab-item:: Linux & NVIDIA
+  .. tab-item:: Linux and NVIDIA
     :sync: linux-nvidia
 
-    Both distro maintainers and NVIDIA package CUDA as such that ``nvcc`` and related
-    tools are on the command-line by default. You should be able to call the
-    compiler on the command-line simply:
+    Both distro maintainers and NVIDIA package CUDA so that ``nvcc`` and related
+    tools are available on the command line by default. You can call the
+    compiler on the command line with:
 
     .. code-block:: bash
-      
+
       nvcc --version
 
-  .. tab-item:: Windows & AMD
+  .. tab-item:: Windows and AMD
     :sync: windows-amd
 
-    Windows compilers and command-line tooling have traditionally
-    relied on extra environmental variables and Path entries to function correctly.
-    Visual Studio refers to command-lines with these setup as "Developer
+    Windows compilers and command line tooling have traditionally relied on
+    extra environmental variables and PATH entries to function correctly.
+    Visual Studio refers to command lines with this setup as "Developer
     Command Prompt" or "Developer PowerShell" for ``cmd.exe`` and PowerShell
     respectively.
 
-    The HIP SDK on Windows doesn't ship a complete toolchain, you will also need:
+    The HIP SDK on Windows doesn't include a complete toolchain. You will also
+    need:
 
-    - the Windows SDK, most crucially providing the import libs to crucial system
-      libraries all executables must link to and some auxiliary compiler tooling.
-    - a Standard Template Library, aka. STL, which HIP too relies on.
+    - The Microsoft Windows SDK. It provides the import libs to crucial system
+      libraries that all executables must link to and some auxiliary compiler
+      tooling.
+    - A Standard Template Library (STL). Installed as part of the Microsoft
+      Visual C++ compiler (MSVC) or with Visual Studio.
 
-    The prior may be installed separately, though it's most conveniently obtained
-    through the Visual Studio installer, while the latter is part of the Microsoft
-    Visual C++ compiler, aka. MSVC, also installed via Visual Studio.
-
-    If you don't already have some SKU of Visual Studio 2022 installed, for a
-    minimal command-line experience, install the
+    If you don't have a version of Visual Studio 2022 installed, for a
+    minimal command line experience, install the
     `Build Tools for Visual Studio 2022 <https://aka.ms/vs/17/release/vs_BuildTools.exe>`_
-    with the Desktop Developemnt Workload and under Individual Components select:
+    with the Desktop Developemnt Workload. Under Individual Components select:
 
-    - some version of the Windows SDK
+    - A version of the Windows SDK
     - "MSVC v143 - VS 2022 C++ x64/x86 build tools (Latest)"
     - "C++ CMake tools for Windows" (optional)
 
     .. note::
 
       The "C++ CMake tools for Windows" individual component is a convenience which
-      puts both ``cmake.exe`` and ``ninja.exe`` onto the ``PATH`` inside developer
-      command-prompts. You can install these manually, but then you need to manage
+      puts both ``cmake.exe`` and ``ninja.exe`` onto the PATH inside developer
+      command prompts. You can install these manually, but then you must manage
       them manually.
 
-    Visual Studio installations as of VS 2017 are detectable as COM object
-    instances via WMI. To setup a command-line from any shell for the latest
-    Visual Studio's default (latest) Visual C++ toolset issue:
+    Visual Studio 2017 and later are detectable as COM object instances via WMI.
+    To setup a command line from any shell for the latest Visual Studio's
+    default Visual C++ toolset issue:
 
     .. code-block:: powershell
 
@@ -269,51 +266,49 @@ both Linux and Windows.
       Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -HostArch amd64 -DevCmdArguments '-no_logo'
       $env:PATH = "${env:HIP_PATH}bin;${env:PATH}"
 
-    You should be able to call the compiler on the command-line now:
+    You should be able to call the compiler on the command line now:
 
     .. code-block:: powershell
 
       clang++ --version
 
-  .. tab-item:: Windows & NVIDIA
+  .. tab-item:: Windows and NVIDIA
     :sync: windows-nvidia
 
-    Windows compilers and command-line tooling have traditionally
-    relied on extra environmental variables and Path entries to function correctly.
-    Visual Studio refers to command-lines with these setup as "Developer
+    Windows compilers and command line tooling have traditionally relied on
+    extra environmental variables and PATH entries to function correctly.
+    Visual Studio refers to command lines with this setup as "Developer
     Command Prompt" or "Developer PowerShell" for ``cmd.exe`` and PowerShell
     respectively.
 
-    The HIP and CUDA SDKs on Windows doesn't ship complete toolchains, you will
+    The HIP and CUDA SDKs on Windows don't include complete toolchains. You will
     also need:
 
-    - the Windows SDK, most crucially providing the import libs to crucial system
-      libraries all executables must link to and some auxiliary compiler tooling.
-    - a Standard Template Library, aka. STL, which HIP too relies on.
-
-    The prior may be installed separately, though it's most conveniently obtained
-    through the Visual Studio installer, while the latter is part of the Microsoft
-    Visual C++ compiler, aka. MSVC, also installed via Visual Studio.
+    - The Microsoft Windows SDK. It provides the import libs to crucial system
+      libraries that all executables must link to and some auxiliary compiler
+      tooling.
+    - A Standard Template Library (STL). Installed as part of the Microsoft
+      Visual C++ compiler (MSVC) or with Visual Studio.
 
-    If you don't already have some SKU of Visual Studio 2022 installed, for a
-    minimal command-line experience, install the
+    If you don't have a version of Visual Studio 2022 installed, for a
+    minimal command line experience, install the
     `Build Tools for Visual Studio 2022 <https://aka.ms/vs/17/release/vs_BuildTools.exe>`_
-    with the Desktop Developemnt Workload and under Individual Components select:
+    with the Desktop Developemnt Workload. Under Individual Components select:
 
-    - some version of the Windows SDK
+    - A version of the Windows SDK
     - "MSVC v143 - VS 2022 C++ x64/x86 build tools (Latest)"
     - "C++ CMake tools for Windows" (optional)
 
     .. note::
 
       The "C++ CMake tools for Windows" individual component is a convenience which
-      puts both ``cmake.exe`` and ``ninja.exe`` onto the ``PATH`` inside developer
-      command-prompts. You can install these manually, but then you need to manage
+      puts both ``cmake.exe`` and ``ninja.exe`` onto the PATH inside developer
+      command prompts. You can install these manually, but then you must manage
       them manually.
 
-    Visual Studio installations as of VS 2017 are detectable as COM object
-    instances via WMI. To setup a command-line from any shell for the latest
-    Visual Studio's default (latest) Visual C++ toolset issue:
+    Visual Studio 2017 and later are detectable as COM object instances via WMI.
+    To setup a command line from any shell for the latest Visual Studio's
+    default Visual C++ toolset issue:
 
     .. code-block:: powershell
 
@@ -321,93 +316,93 @@ both Linux and Windows.
       Import-Module $InstallationPath\Common7\Tools\Microsoft.VisualStudio.DevShell.dll
       Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -HostArch amd64 -DevCmdArguments '-no_logo'
 
-    You should be able to call the compiler on the command-line now:
+    You should be able to call the compiler on the command line now:
 
     .. code-block:: powershell
-      
+
       nvcc --version
 
-Invoking the Compiler Manually
+Invoking the compiler manually
 ------------------------------
 
-To compile and link a single-file application, one may use the following
-command:
+To compile and link a single-file application, use the following commands:
 
 .. tab-set::
-  .. tab-item:: Linux & AMD
+  .. tab-item:: Linux and AMD
     :sync: linux-amd
 
     .. code-block:: bash
 
       amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2
 
-  .. tab-item:: Linux & NVIDIA
+  .. tab-item:: Linux and NVIDIA
     :sync: linux-nvidia
 
     .. code-block:: bash
 
       nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -O2 -x cu
 
-  .. tab-item:: Windows & AMD
+  .. tab-item:: Windows and AMD
     :sync: windows-amd
 
     .. code-block:: powershell
 
       clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2
 
-  .. tab-item:: Windows & NVIDIA
+  .. tab-item:: Windows and NVIDIA
     :sync: windows-nvidia
 
     .. code-block:: powershell
 
       nvcc .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I .\Common -O2 -x cu
 
-Depending on your computer, the resulting binary may or may not run. If not, it
-will typically complain about about "Invalid device function". That error
+Depending on your computer, the resulting binary might or might not run. If not,
+it typically complains about "Invalid device function". That error
 (corresponding to the ``hipErrorInvalidDeviceFunction`` entry of ``hipError_t``)
 means that the runtime could not find a device program binary of the
 appropriate flavor embedded into the executable.
 
-So far we've only talked about how our data makes it from the host to the
-device and back. We've also seen our device code as source, but the HIP runtime
-was arguing about not finding the right binary to dispatch for execution. How
-can one find out what device binary flavors are embedded into the executable?
+So far, the discussion has covered how data makes it from the host to the
+device and back. It has also discussed the device code as source, with the HIP
+runtime arguing that the correct binary to dispatch for execution. How can you
+find out what device binary flavors are embedded into the executable?
 
 .. tab-set::
-  .. tab-item:: Linux & AMD
+
+  .. tab-item:: Linux and AMD
     :sync: linux-amd
 
-    The set of ``roc-*`` utilities shipping with ROCm help significantly to inspect
-    binary artifacts on disk. If you wish to use these utilities, add the ROCmCC
-    installation folder to your PATH (the utilities expect them to be on the PATH).
+    The utilities included with ROCm help significantly to inspect binary
+    artifacts on disk. Add the ROCmCC installation folder to your PATH if you
+    want to use these utilities (the utilities expect them to be on the PATH).
 
-    Lisitng of the embedded program binaries can be done using ``roc-obj-ls``
+    You can list embedded program binaries using ``roc-obj-ls``.
 
     .. code-block:: bash
 
       roc-obj-ls ./saxpy
 
-    It may return something like:
+    It should return something like:
 
     .. code-block:: shell
 
       1       host-x86_64-unknown-linux         file://./saxpy#offset=12288&size=0
       1       hipv4-amdgcn-amd-amdhsa--gfx803   file://./saxpy#offset=12288&size=9760
 
-    We can see that the compiler embedded a version 4 code object (more on `code
+    The compiler embeds a version 4 code object (more on `code
     object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_)
     and used the LLVM target triple `amdgcn-amd-amdhsa--gfx803` (more on `target triples
-    <https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). We can
-    extract that program object in a disassembled fashion for human consumption via
-    `roc-obj`
+    <https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). You can
+    extract that program object in a disassembled fashion for human consumption
+    via ``roc-obj``.
 
     .. code-block:: bash
 
       roc-obj -t gfx803 -d ./saxpy
 
-    Which will create two files on disk and we'll be interested in the one with the
-    ``.s`` extension. Opening up said file or dumping it to the console using ``cat``
-    one will find the disassembled binary of our saxpy compute kernel, something
+    This creates two files on disk and ``.s`` extension is of most interest.
+    Opening this file or dumping it to the console using ``cat``
+    lets find the disassembled binary of the SAXPY compute kernel, something
     similar to:
 
     .. code-block::
@@ -442,14 +437,14 @@ can one find out what device binary flavors are embedded into the executable?
           flat_store_dword v[0:1], v3          // 000000001080: DC700000 00000300
           s_endpgm                             // 000000001088: BF810000
 
-    Alternatively we can call the compiler with ``--save-temps`` to dump all device
+    Alternatively, call the compiler with ``--save-temps`` to dump all device
     binary to disk in separate files.
 
     .. code-block:: bash
 
       amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --save-temps
 
-    Now we can list all the temporaries created while compiling ``main.hip`` via
+    List all the temporaries created while compiling ``main.hip`` with:
 
     .. code-block:: bash
 
@@ -461,11 +456,11 @@ can one find out what device binary flavors are embedded into the executable?
       main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt
       main-hip-amdgcn-amd-amdhsa-gfx803.s
 
-    Files with the ``.s`` extension hold the disassembled contents of the binary and
-    the filename directly informs us of the graphics IPs used by the compiler. The
-    contents of this file is very similar to what ``roc-obj`` printed to the console.
+    Files with the ``.s`` extension hold the disassembled contents of the binary.
+    The filename notes the graphics IPs used by the compiler. The contents of
+    this file are similar to what ``roc-obj`` printed to the console.
 
-  .. tab-item:: Linux & NVIDIA
+  .. tab-item:: Linux and NVIDIA
     :sync: linux-nvidia
 
     Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting
@@ -477,11 +472,11 @@ can one find out what device binary flavors are embedded into the executable?
 
     .. code-block:: bash
 
-      cuobjdump --list-ptx ./saxpy 
+      cuobjdump --list-ptx ./saxpy
 
     Which will print something like:
 
-    .. code-block:: 
+    .. code-block::
 
       PTX file    1: saxpy.1.sm_52.ptx
 
@@ -507,7 +502,7 @@ can one find out what device binary flavors are embedded into the executable?
 
     The output may look like:
 
-    .. code-block:: 
+    .. code-block::
 
       000000014004C000: 5F474E414C435F5F 5F44414F4C46464F   __CLANG_OFFLOAD_
       000000014004C010: 5F5F454C444E5542 0000000000000002   BUNDLE__........
@@ -524,7 +519,7 @@ can one find out what device binary flavors are embedded into the executable?
 
     We can see that the compiler embedded a version 4 code object (more on code
     `object versions <https://www.llvm.org/docs/AMDGPUUsage.html#code-object-metadata>`_) and
-    used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples 
+    used the LLVM target triple `amdgcn-amd-amdhsa--gfx906` (more on `target triples
     <https://www.llvm.org/docs/AMDGPUUsage.html#target-triples>`_). Don't be
     alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to
     the GPU for execution are proper linux ELF binaries in their format.
@@ -595,40 +590,40 @@ can one find out what device binary flavors are embedded into the executable?
               s_endpgm
               ...
 
-  .. tab-item:: Windows & NVIDIA
+  .. tab-item:: Windows and NVIDIA
     :sync: windows-nvidia
 
-    Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting
-    binary will be a valid CUDA executable as far as the binary goes. Therefor
-    it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set
-    Architecture) instead of AMDGPU binary. As s result, tooling shipping with the
-    CUDA SDK can be used to inspect which device ISA got compiled into a specific
-    executable. The tool most useful to us currently is ``cuobjdump``.
+    Unlike HIP on AMD, when compiling using the NVIDIA support for HIP, the resulting
+    binary will be a valid CUDA executable. Therefore, it'll incorporate PTX ISA
+    (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU
+    binary. As a result, tooling included with the CUDA SDK can be used to
+    inspect which device ISA was compiled into a specific executable. The most
+    helpful to us currently is ``cuobjdump``.
 
     .. code-block:: bash
 
       cuobjdump.exe --list-ptx .\saxpy.exe
 
-    Which will print something like:
+    Which prints something like:
 
-    .. code-block:: 
+    .. code-block::
 
       PTX file    1: saxpy.1.sm_52.ptx
 
-    From this we can see that the saxpy kernel is stored as ``sm_52``, which shows
-    that a compute capability 5.2 ISA got embedded into the executable, so devices
-    which sport compute capability 5.2 or newer will be able to run this code.
+    This example shows that the SAXPY kernel is stored as ``sm_52``. It also shows
+    that a compute capability 5.2 ISA was embedded into the executable, so devices
+    that support compute capability 5.2 or newer will be able to run this code.
 
-Now that we've found what binary got embedded into the executable, we only need
-to find which format our available devices use.
+Now that you've found what binary got embedded into the executable, find which
+format our available devices use.
 
 .. tab-set::
-  .. tab-item:: Linux & AMD
+  .. tab-item:: Linux and AMD
     :sync: linux-amd
 
-    On Linux a utility called ``rocminfo`` can help us list all the properties of the
+    On Linux a utility called ``rocminfo`` helps us list all the properties of the
     devices available on the system, including which version of graphics IP
-    (``gfxXYZ``) they employ. We'll filter the output to have only these lines:
+    (``gfxXYZ``) they employ. You can filter the output to have only these lines:
 
     .. code-block:: bash
 
@@ -636,27 +631,23 @@ to find which format our available devices use.
         Name:                    gfx906
             Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
 
-    _(For the time being let's not discuss what the colon-dlimited list of device
-    features are after the graphics IP. Until further notice we'll treat them as
-    part of the binary version.)_
-
-  .. tab-item:: Linux & NVIDIA
+  .. tab-item:: Linux and NVIDIA
     :sync: linux-nvidia
 
-    On Linux HIP with the NVIDIA back-end a CUDA SDK sample called ``deviceQuery``
+    On Linux HIP with the NVIDIA back-end, the ``deviceQuery`` CUDA SDK sample
     can help us list all the properties of the devices available on the system,
     including which version of compute capability a device sports.
-    (``<major>.<minor>`` compute capability is passed to ``nvcc`` on the
-    command-line as ``sm_<major><minor>``, for eg. ``8.6`` is ``sm_86``.)
+    ``<major>.<minor>`` compute capability is passed to ``nvcc`` on the
+    command-line as ``sm_<major><minor>``, for eg. ``8.6`` is ``sm_86``.
 
-    Because it's not shipped as a binary, we may as well compile the matching
+    Because it's not included as a binary, compile the matching
     example from ROCm.
 
     .. code-block:: bash
 
       nvcc ./HIP-Basic/device_query/main.cpp -o device_query -I ./Common -I /opt/rocm/include -O2
 
-    We'll filter the output to have only the lines of interest, for eg.:
+    Filter the output to have only the lines of interest, for example:
 
     .. code-block:: bash
 
@@ -666,18 +657,18 @@ to find which format our available devices use.
 
     .. note::
 
-      Next to the ``nvcc`` executable is another tool called ``__nvcc_device_query``
-      which simply prints the SM Architecture numbers to standard out as a comma
-      separated list of numbers. The naming of this utility suggests it's not a user
-      facing executable but is used by ``nvcc`` to determine what devices are in the
+      In addition to the ``nvcc`` executable is another tool called ``__nvcc_device_query``
+      which prints the SM Architecture numbers to standard out as a comma
+      separated list of numbers. The utility's name suggests it's not a user-facing
+      executable but is used by ``nvcc`` to determine what devices are in the
       system at hand.
 
-  .. tab-item:: Windows & AMD
+  .. tab-item:: Windows and AMD
     :sync: windows-amd
 
-    On Windows a utility called ``hipInfo.exe`` can help us list all the properties
+    On Windows, a utility called ``hipInfo.exe`` helps us list all the properties
     of the devices available on the system, including which version of graphics IP
-    (``gfxXYZ``) they employ. We'll filter the output to have only these lines:
+    (``gfxXYZ``) they employ. Filter the output to have only these lines:
 
     .. code-block:: powershell
 
@@ -686,23 +677,23 @@ to find which format our available devices use.
       gcnArchName:                      gfx1032
       gcnArchName:                      gfx1035
 
-  .. tab-item:: Winodws & NVIDIA
+  .. tab-item:: Windows and NVIDIA
     :sync: windows-nvidia
 
-    On Windows HIP with the NVIDIA back-end a CUDA SDK sample called ``deviceQuery``
+    On Windows HIP with the NVIDIA back-end, the ``deviceQuery`` CUDA SDK sample
     can help us list all the properties of the devices available on the system,
     including which version of compute capability a device sports.
-    (``<major>.<minor>`` compute capability is passed to ``nvcc`` on the
-    command-line as ``sm_<major><minor>``, for eg. ``8.6`` is ``sm_86``.)
+    ``<major>.<minor>`` compute capability is passed to ``nvcc`` on the
+    command-line as ``sm_<major><minor>``, for eg. ``8.6`` is ``sm_86``.
 
-    Because it's not shipped as a binary, we may as well compile the matching
+    Because it's not included as a binary, compile the matching
     example from ROCm.
 
     .. code-block:: powershell
 
       nvcc .\HIP-Basic\device_query\main.cpp -o device_query.exe -I .\Common -I ${env:HIP_PATH}include -O2
 
-    We'll filter the output to have only the lines of interest, for eg.:
+    Filter the output to have only the lines of interest, for example:
 
     .. code-block:: powershell
 
@@ -719,26 +710,27 @@ to find which format our available devices use.
       facing executable but is used by ``nvcc`` to determine what devices are in the
       system at hand.
 
-Now that we know which versions of graphics IP our devices use, we can
-recompile our program with said parameters.
+Now that you know which graphics IPs our devices use, recompile your program with
+the appropriate parameters.
 
 .. tab-set::
-  .. tab-item:: Linux & AMD
+
+  .. tab-item:: Linux and AMD
     :sync: linux-amd
 
     .. code-block:: bash
 
       amdclang++ ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -lamdhip64 -L /opt/rocm/lib -O2 --offload-arch=gfx906:sramecc+:xnack-
 
-    Now our sample will surely run.
+    Now the sample will run.
 
-    .. code-block:: 
+    .. code-block::
 
       ./saxpy
       Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
       First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
 
-  .. tab-item:: Linux & NVIDIA
+  .. tab-item:: Linux and NVIDIA
     :sync: linux-nvidia
 
     .. code-block:: bash
@@ -750,22 +742,22 @@ recompile our program with said parameters.
       If you want to portably target the development machine which is compiling, you
       may specify ``-arch=native`` instead.
 
-    Now our sample will surely run.
+    Now the sample will run.
 
-    .. code-block:: 
+    .. code-block::
 
       ./saxpy
       Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
       First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
 
-  .. tab-item:: Windows & AMD
+  .. tab-item:: Windows and AMD
     :sync: windows-amd
 
     .. code-block:: powershell
 
       clang++ .\HIP-Basic\saxpy\main.hip -o saxpy.exe -I .\Common -lamdhip64 -L ${env:HIP_PATH}lib -O2 --offload-arch=gfx1032 --offload-arch=gfx1035
 
-    Now our sample will surely run.
+    Now the sample will run.
 
     .. code-block::
 
@@ -773,7 +765,7 @@ recompile our program with said parameters.
       Calculating y[i] = a * x[i] + y[i] over 1000000 elements.
       First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
 
-  .. tab-item:: Windows & NVIDIA
+  .. tab-item:: Windows and NVIDIA
     :sync: windows-nvidia
 
     .. code-block:: powershell
@@ -785,9 +777,9 @@ recompile our program with said parameters.
       If you want to portably target the development machine which is compiling, you
       may specify ``-arch=native`` instead.
 
-    Now our sample will surely run.
+    Now the sample will run.
 
-    .. code-block:: 
+    .. code-block::
 
       .\saxpy.exe
       Calculating y[i] = a * x[i] + y[i] over 1000000 elements.