Skip to content

Commit

Permalink
Update error hanfling
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed Oct 11, 2024
1 parent 6b337b2 commit aad09d3
Showing 1 changed file with 59 additions and 121 deletions.
180 changes: 59 additions & 121 deletions docs/how-to/hip_runtime_api/error_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,146 +6,85 @@
Error handling
********************************************************************************

HIP provides functionality to detect, report, and manage errors that occur during the execution of HIP runtime functions or when launching kernels. Every HIP runtime function, apart from launching kernels, has `hipError_t` as return type.
:cpp:func:`hipGetLastError()` and :cpp:func:`hipPeekAtLastError()` can be used for catching errors from kernel launches, as kernel launches don't return an error directly. HIP maintains an internal state, that includes the last error code. hipGetLastError returns and resets that error to hipSuccess, while hipPeekAtLastError just returns the error without changing it.
To get a human readable version of the errors, :cpp:func:`hipGetErrorString()` and hipGetErrorName() can be used.
error-checking macros, developers can ensure their HIP applications are robust,
easier to debug, and more reliable. Proper error handling is crucial for
identifying issues early in the development process and ensuring that
applications behave as expected.

Strategies
================================================================================

One of the fundamental best practices in error handling is to develop a
consistent strategy across the entire application. This involves defining
how errors are reported, logged, and managed. This can be achieved by using a
centralized error handling mechanism that ensures consistency and reduces
redundancy. For instance, using macros to simplify error checking and reduce
code duplication is a common practice. A macro like ``HIP_CHECK`` can be defined
to check the return value of HIP API calls and handle errors appropriately.

Granular error reporting
--------------------------------------------------------------------------------

It involves reporting errors at the appropriate level of detail. Too much detail
can overwhelm users, while too little can make debugging difficult.
Differentiating between user-facing errors and internal errors is crucial.

Fail-fast principle
--------------------------------------------------------------------------------

It involves detecting and handling errors as early as possible to prevent them
from propagating and causing more significant issues. Such as validating inputs
and preconditions before performing operations.

Resource management
--------------------------------------------------------------------------------

Ensuring that resources such as memory, file handles, and network connections
are properly managed and released in the event of an error is essential.
HIP provides functionality to detect, report, and manage errors that occur
during the execution of HIP runtime functions or when launching kernels. Every
HIP runtime function, apart from launching kernels, has :cpp:type:`hipError_t`
as return type. :cpp:func:`hipGetLastError()` and :cpp:func:`hipPeekAtLastError()`
can be used for catching errors from kernel launches, as kernel launches don't
return an error directly. HIP maintains an internal state, that includes the
last error code. hipGetLastError returns and resets that error to hipSuccess,
while :cpp:func:`hipPeekAtLastError` just returns the error without changing it.
To get a human readable version of the errors, :cpp:func:`hipGetErrorString()`
and :cpp:func:`hipGetErrorName()` can be used.

Error handling usage
================================================================================

:cpp:func:`hipGetLastError` and :cpp:func:`hipPeekAtLastError`
are used to detect errors after HIP API calls. This ensures that any issues are
caught early in the execution flow.
Error handling functions enable developers to implement appropriate error
handling strategies, such as retry mechanisms, resource cleanup, or graceful
degradation.

For reporting, :cpp:func:`hipGetErrorName` and :cpp:func:`hipGetErrorString`
provide meaningful error messages that can be logged or displayed to users.
This helps in understanding the nature of the error and facilitates debugging.
The descriptions of the important :ref:`error handling functions <error_handling_reference>`
are the following:

By checking for errors and providing detailed information, these functions
enable developers to implement appropriate error handling strategies, such as
retry mechanisms, resource cleanup, or graceful degradation.
* :cpp:func:`hipGetLastError` returns the last error that occurred during a HIP
runtime API call and resets the error code to :cpp:enumerator:`hipSuccess`.
* :cpp:func:`hipPeekAtLastError` returns the last error that occurred during a HIP
runtime API call **without** resetting the error code.
* :cpp:func:`hipGetErrorName` converts a HIP error code to a string representing
the error name:
* :cpp:func:`hipGetErrorString` converts a HIP error code to a string describing
the error.

Error handling function examples
--------------------------------------------------------------------------------
Best practices of HIP error handling:

:cpp:func:`hipGetLastError` returns the last error that occurred during a HIP
runtime API call and resets the error code to :cpp:enumerator:`hipSuccess`:
1. Check errors after each API call - Avoid error propagation.
2. Use macros for error checking - Check :ref:`hip_check_macros`.
3. Handle errors gracefully - Free resources and provide meaningful error
messages to the user.

.. code-block:: cpp
.. _hip_check_macros:

hipError_t err = hipGetLastError();
if (err != hipSuccess)
{
std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl;
}
HIP check macros
--------------------------------------------------------------------------------

:cpp:func:`hipPeekAtLastError` returns the last error that occurred during a HIP
runtime API call **without** resetting the error code:
HIP uses check macros to simplify error checking and reduce code duplication.
The ``HIP_CHECK`` macros are mainly used to detect and report errors, but there
are some versions which also exit from the application. The different
``HIP_CHECK`` examples:

.. code-block:: cpp
hipError_t err = hipPeekAtLastError();
if (err != hipSuccess)
{
std::cerr << "HIP error: " << hipGetErrorString(err) << std::endl;
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
:cpp:func:`hipGetErrorName` converts a HIP error code to a string representing
the error name:

.. code-block:: cpp
std::cerr << "Error name: " << hipGetErrorName(err) << std::endl;
:cpp:func:`hipGetErrorString` converts a HIP error code to a string describing
the error:

.. code-block:: cpp
std::cerr << "Error description: " << hipGetErrorString(err) << std::endl;
Best practices
--------------------------------------------------------------------------------

1. Check errors after each API call

Always check the return value of HIP API calls to catch errors early. For
example:

.. code-block:: cpp
hipError_t err = hipMalloc(&d_A, size);
if (err != hipSuccess) {
std::cerr << "hipMalloc failed: " << hipGetErrorString(err) << std::endl;
return -1;
}
2. Use macros for error checking

Define macros to simplify error checking and reduce code duplication. For
example:

.. code-block:: cpp
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
} \
}
// Allocate memory on the device with error check.
HIP_CHECK(hipMalloc(&device_pointer, N * sizeof(float);));
3. Handle errors gracefully

Ensure the application can handle errors gracefully, such as by freeing
resources or providing meaningful error messages to the user.
#define HIP_CHECK_EXIT(expression) \
{ \
const hipError_t status = expression; \
if(status != hipSuccess){ \
std::cerr << "HIP error " \
<< status << ": " \
<< hipGetErrorString(status) \
<< " at " << __FILE__ << ":" \
<< __LINE__ << std::endl; \
exit(1); \
} \
}
Complete example
--------------------------------------------------------------------------------

A complete example demonstrating error handling:
A complete example to demonstrate the error handling with a simple addition of
two values kernel:

.. code-block:: cpp
Expand Down Expand Up @@ -193,7 +132,6 @@ A complete example demonstrating error handling:
// Check for kernel execution error
HIP_CHECK(hipDeviceSynchronize());
// Copy the result back to the host.
HIP_CHECK(hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost));
Expand Down

0 comments on commit aad09d3

Please sign in to comment.