WIP

ROCm · May 24, 2024 · b0ec7c3 · b0ec7c3
1 parent 7dd32ef
commit b0ec7c3
Show file tree

Hide file tree

Showing 2 changed files with 39 additions and 32 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -8,6 +8,7 @@ clr
 enqueue
 enqueues
 embeded
+extern
 fatbinary
 GPGPU
 hipcc

diff --git a/docs/how-to/faq.md b/docs/how-to/faq.md
@@ -4,16 +4,16 @@
 
 HIP provides the following:
 
-* Devices (hipSetDevice(), hipGetDeviceProperties(), etc.)
-* Memory management (hipMalloc(), hipMemcpy(), hipFree(), etc.)
-* Streams (hipStreamCreate(),hipStreamSynchronize(), hipStreamWaitEvent(),  etc.)
-* Events (hipEventRecord(), hipEventElapsedTime(), etc.)
+* Devices (`hipSetDevice()`, `hipGetDeviceProperties()`, etc.)
+* Memory management (`hipMalloc()`, `hipMemcpy()`, `hipFree()`, etc.)
+* Streams (`hipStreamCreate()`, `hipStreamSynchronize()`, `hipStreamWaitEvent()`, etc.)
+* Events (`hipEventRecord()`, `hipEventElapsedTime()`, etc.)
 * Kernel launching (hipLaunchKernel/hipLaunchKernelGGL is the preferred way of launching kernels. hipLaunchKernelGGL is a standard C/C++ macro that can serve as an alternative way to launch kernels, replacing the CUDA triple-chevron (<<< >>>) syntax).
-* HIP Module API to control when adn how code is loaded.
-* CUDA-style kernel coordinate functions (threadIdx, blockIdx, blockDim, gridDim)
+* HIP Module API to control when and how code is loaded.
+* CUDA-style kernel coordinate functions (`threadIdx`, `blockIdx`, `blockDim`, `gridDim`)
 * Cross-lane instructions including shfl, ballot, any, all
 * Most device-side math built-ins
-* Error reporting (hipGetLastError(), hipGetErrorString())
+* Error reporting (`hipGetLastError()`, `hipGetErrorString()`)
 
 The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API.
 
@@ -56,44 +56,44 @@ However, we can provide a rough summary of the features included in each CUDA SD
   * HIP supports CUDA 4.0 except for the limitations described above.
 * CUDA 5.0 :
   * Dynamic Parallelism (not supported)
-  * cuIpc functions (under development).
+  * `cuIpc` functions (under development).
 * CUDA 6.0 :
   * Managed memory (under development)
 * CUDA 6.5 :
-  * __shfl intrinsic (supported)
+  * `__shfl` intrinsic (supported)
 * CUDA 7.0 :
   * Per-thread default streams (supported)
   * C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
 * CUDA 7.5 :
   * float16 (supported)
 * CUDA 8.0 :
-  * Page Migration including cudaMemAdvise, cudaMemPrefetch, other cudaMem* APIs(not supported)
+  * Page Migration including `cudaMemAdvise`, `cudaMemPrefetch`, other `cudaMem*` APIs(not supported)
 * CUDA 9.0 :
   * Cooperative Launch, Surface Object Management, Version Management
 
 ## What libraries does HIP support?
 
-HIP includes growing support for the four key math libraries using hipBlas, hipFFt, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications.
+HIP includes growing support for the four key math libraries using hipBLAS, hipFFt, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications.
 These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications.
 The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces.
 
-* [hipBlas](https://github.com/ROCmSoftwarePlatform/hipBLAS), which utilizes [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS).
+* [hipBLAS](https://github.com/ROCmSoftwarePlatform/hipBLAS), which utilizes [rocBlas](https://github.com/ROCmSoftwarePlatform/rocBLAS).
 * [hipFFt](https://github.com/ROCmSoftwarePlatform/hipfft)
 * [hipsSPARSE](https://github.com/ROCmSoftwarePlatform/hipsparse)
 * [hipRAND](https://github.com/ROCmSoftwarePlatform/hipRAND)
 * [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen)
 
-Additionally, some of the cublas routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cublas or hcblas depending on the platform and replace the need to use conditional compilation.
+Additionally, some of the cuBLAS routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cuBLAS or hcBLAS depending on the platform and replace the need to use conditional compilation.
 
 ## How does HIP compare with OpenCL?
 
-Both AMD and Nvidia support OpenCL 1.2 on their devices so that developers can write portable code.
+Both AMD and NVIDIA support OpenCL 1.2 on their devices so that developers can write portable code.
 HIP offers several benefits over OpenCL:
 
 * Developers can code in C++ as well as mix host and device C++ code in their source files. HIP C++ code can use templates, lambdas, classes and so on.
 * The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
 * Because both CUDA and HIP are C++ languages, porting from CUDA to HIP is significantly easier than porting from CUDA to OpenCL.
-* HIP uses the best available development tools on each platform: on Nvidia GPUs, HIP code compiles using NVCC and can employ the nSight profiler and debugger (unlike OpenCL on Nvidia GPUs).
+* HIP uses the best available development tools on each platform: on NVIDIA GPUs, HIP code compiles using NVCC and can employ the nSight profiler and debugger (unlike OpenCL on NVIDIA GPUs).
 * HIP provides pointers and host-side pointer arithmetic.
 * HIP provides device-level control over memory allocation and placement.
 * HIP offers an offline compilation model.
@@ -113,7 +113,7 @@ The tools also struggle with more complex CUDA applications, in particular, thos
 ## What hardware does HIP support?
 
 * For AMD platforms, see the [ROCm documentation](https://github.com/RadeonOpenCompute/ROCm#supported-gpus) for the list of supported platforms.
-* For Nvidia platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the Nvidia Titan and Tesla K40.
+* For NVIDIA platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the NVIDIA Titan and Tesla K40.
 
 ## Do HIPIFY tools automatically convert all source code?
 
@@ -125,7 +125,7 @@ In general, developers should always expect to perform some platform-specific tu
 
 ## What is NVCC?
 
-NVCC is Nvidia's compiler driver for compiling "CUDA C++" code into PTX or device code for Nvidia GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
+NVCC is NVIDIA's compiler driver for compiling "CUDA C++" code into PTX or device code for NVIDIA GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
 
 ## What is HIP-Clang?
 
@@ -134,10 +134,10 @@ HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run o
 ## Why use HIP rather than supporting CUDA directly?
 
 While HIP is a strong subset of the CUDA, it is a subset.  The HIP layer allows that subset to be clearly defined and documented.
-Developers who code to the HIP API can be assured their code will remain portable across Nvidia and AMD platforms.
+Developers who code to the HIP API can be assured their code will remain portable across NVIDIA and AMD platforms.
 In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.
 
-## Can I develop HIP code on an Nvidia CUDA platform?
+## Can I develop HIP code on an NVIDIA CUDA platform?
 
 Yes.  HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends.
 "Extra" APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors.
@@ -177,7 +177,7 @@ hipother supports the HIP back-end implementation on some non-AMD platforms, lik
 
 No, there is no HIP repository open publicly on Windows.
 
-## Can a HIP binary run on both AMD and Nvidia platforms?
+## Can a HIP binary run on both AMD and NVIDIA platforms?
 
 HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a "fat binary" that can run on either platform, however.
 
@@ -264,19 +264,22 @@ export HIP_PLATFORM=nvidia
 ```
 
 In this case, HIP will set and use the following,
+
+```shell
 HIP_COMPILER=cuda
 HIP_RUNTIME=nvcc
+```
 
 One symptom of this problem is the message "error: 'unknown error'(11) at square.hipref.cpp:56".  This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as nvcc.  HIP may be able to compile the application using the nvcc tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
 
 ## On CUDA, can I mix CUDA code with HIP code?
 
 Yes.  Most HIP data structures (hipStream_t, hipEvent_t) are typedefs to CUDA equivalents and can be intermixed.  Both CUDA and HIP use integer device ids.
-One notable exception is that hipError_t is a new type, and cannot be used where a cudaError_t is expected.  In these cases, refactor the code to remove the expectation.  Alternatively, hip_runtime_api.h defines functions which convert between the error code spaces:
+One notable exception is that `hipError_t` is a new type, and cannot be used where a `cudaError_t` is expected.  In these cases, refactor the code to remove the expectation.  Alternatively, hip_runtime_api.h defines functions which convert between the error code spaces:
 
-hipErrorToCudaError
-hipCUDAErrorTohipError
-hipCUResultTohipError
+`hipErrorToCudaError`
+`hipCUDAErrorTohipError`
+`hipCUResultTohipError`
 
 If platform portability is important, use #ifdef __HIP_PLATFORM_NVIDIA__ to guard the CUDA-specific code.
 
@@ -297,7 +300,7 @@ __shfl_*_sync is not supported on HIP but for nvcc path CUDA 9.0 and above all s
 
 The compiler defines the `__HIP_DEVICE_COMPILE__` macro only when compiling the code for the GPU.  It could be used to guard code that is specific to the host or the GPU.
 
-## Why _OpenMP is undefined when compiling with -fopenmp?
+## Why _OpenMP is undefined when compiling with `-fopenmp`?
 
 When compiling an OpenMP source file with `hipcc -fopenmp`, the compiler may generate error if there is a reference to the `_OPENMP` macro.  This is due to a limitation in hipcc that treats any source file type (for example `.cpp`) as an HIP translation unit leading to some conflicts with the OpenMP language switch.  If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the `-x c++` switch to force the compiler to treat the file as regular C++.  Another approach would be to guard the OpenMP code with `#ifdef _OPENMP` so that the code block is disabled when compiling for the GPU.  The `__HIP_DEVICE_COMPILE__` macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
 
@@ -317,10 +320,10 @@ If you have compiled the application yourself, make sure you have given the corr
 If you have a precompiled application/library (like rocblas, tensorflow etc) which gives you such error, there are one of two possibilities.
 
 * The application/library does not ship code object bundles for __all__ of your device(s): in this case you need to recompile the application/library yourself with correct `--offload-arch`.
-* The application/library does not ship code object bundles for __some__ of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` on NVdia platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
+* The application/library does not ship code object bundles for __some__ of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable `HIP_VISIBLE_DEVICES` or `CUDA_VISIBLE_DEVICES` on NVIDIA platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
 
-Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message "Unable to find code object for all current devices".
-The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message "Error: shared object initialization failed" on unsupported GPU.
+Note: In previous releases, the error code is `hipErrorNoBinaryForGpu` with message "Unable to find code object for all current devices".
+The error code handling behavior is changed. HIP runtime shows the error code `hipErrorSharedObjectInitFailed` with message "Error: shared object initialization failed" on unsupported GPU.
 
 ## How to use per-thread default stream in HIP?
 
@@ -329,21 +332,24 @@ The per-thread default stream is an implicit stream local to both the thread and
 The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
 
 In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled.
-"-fgpu-default-stream=per-thread".
+`-fgpu-default-stream=per-thread`.
 Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
 
 Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
 
 ## How to use complex muliplication and division operations?
 
-In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types,
+In HIP, `hipFloatComplex` and `hipDoubleComplex` are defined as complex data types,
+
+```c++
 typedef float2 hipFloatComplex;
 typedef double2 hipDoubleComplex;
+```
 
 Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
 
-* hipCmulf() and hipCdivf() for hipFloatComplex
-* hipCmul() and hipCdiv() for hipDoubleComplex
+* `hipCmulf()` and `hipCdivf()` for `hipFloatComplex`
+* `hipCmul()` and `hipCdiv()` for `hipDoubleComplex`
 
 Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform.