Skip to content

Commit

Permalink
OptiX PTX pipeline overhaul (#1680)
Browse files Browse the repository at this point in the history
## Description

This PR is an overhaul of PTX compilation in OSL. The goal is to minimize the size of the generated PTX to help reduce time spent in subsequent stages (e.g., module creation, pipeline linking, etc.).

### rend_lib

The main functional change is in how the functions defined in rend_lib.cu are handled. In the current setup, rend_lib.cu is compiled to a standalone PTX file using nvcc. The functions are considered `extern` in generated shaders, and the dependencies are satisfied when the final pipeline is linked together. Since the functions are `extern`, they are not available for optimization when each shader is generated. This prevents the inlining of small functions, which incurs a lot of function call overhead in terms of PTX size and run-time cost.

So instead of compiling rend_lib.cu separately, it is compiled to LLVM bitcode along with the rest of the "shadeops" sources. This bitcode is used to seed the LLVM Module for each shader. This makes the definitions available when each shader is being generated and optimized, which allows the optimizer to make better decisions. It also allows for better control of inlining decisions which affect the size and quality of the generated PTX.

### shadeops

Another significant functional change is in how the shadeops functions (including those defined in rend_lib.cu) are treated. In the current pipeline, each shader must carry along with it definitions for all non-inlined shadeops functions that are used. This bloats the size of the PTX for each shader, and results in many duplicate copies of those functions existing in the final linked pipeline.

In the new pipeline, the unified shadeops+rend_lib bitcode is compiled to PTX using the offline `llc` tool. This PTX is used by the renderer to create a single "shadeops" module that provides the definitions for those functions. This makes it possible to drop the definitions for those functions from the PTX for each shader, which can significantly reduce the size of the generated PTX.

### quirks

This PR contains some interesting "quirks" which are largely fallout from compiling rend_lib.cu using clang instead of nvcc:

- The clang-generated PTX is processed by a Python script to transform some function & symbol declarations to better match what was generated by nvcc. It might be possible to add the right decorators to the C++/CUDA code to get the correct visibility for these symbols, but I was not successful in doing so.
- It was necessary to change the function signatures for some of the rend_lib functions to use `void*` instead of the actual OSL types (e.g., `OSL::Color3*`). Trying to use the actual types produces an LLVM assertion failure, for reasons that I don't understand.
- I have enabled flush-to-zero (FTZ) for clang-compiled code, to better match the behavior of nvcc. I added a CMake variable to enable or disable FTZ in the event that somebody wants to disable it.
- I have added some alignment hints on pointer accesses in rend_lib.cu. clang does not appear to be able to deduce the alignment of pointers in many cases, which turns things like `memcpy` and `memset` into long series of byte operations. Functions like `osl_get_matrix` are bloated substantially if the hints aren't provided.

---------

Signed-off-by: Tim Grant <[email protected]>
  • Loading branch information
tgrant-nv authored Sep 8, 2023
1 parent 8c2d8c0 commit f449749
Show file tree
Hide file tree
Showing 24 changed files with 1,632 additions and 283 deletions.
112 changes: 112 additions & 0 deletions doc/app_integration/OptiX-Inlining-Options.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
<!-- SPDX-License-Identifier: CC-BY-4.0 -->
<!-- Copyright Contributors to the Open Shading Language Project. -->

Inlining Options for OptiX and CUDA
===================================

When compiling shaders for OptiX and CUDA (and in general), there is a tradeoff
between compile speed and shade-time performance. The LLVM optimizer generally
does a good job of balancing these concerns, but there might be cases where a
renderer can give additional hints to the optimizer to tip the balance one way
or the other.

Aggressive inlining can increase the run-time performance, but can negatively
impact the compile speed. Inlining can be very helpful with small functions
where the function call overhead tends to dwarf the useful instructions, so it
is important to inline such functions when possible.

Choosing to __not__ inline certain functions (e.g., very large noise functions)
allows them to be excluded from the module prior to running the LLVM optimizer
and JIT engine, which can greatly improve the compile time. This is particularly
beneficial when a function is not likely to be inlined anyway; removing such
large functions from the module prior to optimization can speed up compilation
considerably without affecting the generated PTX.

ShadingSystem Attributes
------------------------

There are a number of `ShadingSystem` attributes to help control the inlining
behavior. The default settings should work well in most circumstances, but they
can be adjusted to favor compile speed over shade-time performance, or vice
versa.

* `optix_no_inline_thresh`: Don't inline functions greater-than or equal-to the
threshold. This allows them to be excluded from the module prior to
optimization, which reduces the size of module and can greatly speed up the
optimization and JIT stages.

* `optix_force_inline_thresh`: Force inline functions less-than or equal-to the
threshold. This tends to be most helpful with relatively low values, < 30.

* `optix_no_inline`: Don't inline any functions. Offers the best compile times
at the expense of shade-time performance. This option is not recommended, but is
included for benchmarking and tuning purposes.

* `optix_no_inline_layer_funcs`: Don't inline the shader layer functions. This
can moderately improve compile times at the expense of shade-time performance.

* `optix_merge_layer_funcs`: Allow layer functions that are only called once to
be merged into their caller, even if `optix_no_inline_layer_funcs` is set. This
can help restore some of the shade-time performance lost by enabling
`optix_no_inline_layer_funcs`.

* `optix_no_inline_rend_lib`: Don't inline any functions defined in the
renderer-supplied `rend_lib` module. As an alternative, the renderer can simply
not supply the LLVM bitcode for the `rend_lib` module to the `ShadingSystem`.

Inline/Noinline Function Registration
-------------------------------------

In addition to the `ShadingSystem` attributes, individual functions can be
registered with the `ShadingSystem` as `inline` or `noinline`. Functions can
be unregistered to restore the default inlining behavior. This registration
takes precedence over the `ShadingSystem` inlining attributes, which allows
very fine-grained control when needed.

```C++
// Register
shadingsys->register_inline_function(ustring("osl_abs_ff"));
shadingsys->register_noinline_function(ustring("osl_gabornoise_dfdfdf"));

// Unregister
shadingsys->unregister_inline_function(ustring("osl_abs_ff"));
shadingsys->unregister_noinline_function(ustring("osl_gabornoise_dfdfdf"));
```
It might be best to prefer the `ShadingSystem` attributes to control the inlining
behavior, and to strategically register functions when it is known to be
beneficial through benchmarking and profiling.
Tuning and Analysis
-------------------
We have added a Python script (`src/build-scripts/analyze-ptx.py`) to help
identify functions that might be good candidates for inlining/noinling. This
script will generate a summary of the functions in the input PTX file, with a
list of all functions and their sizes in CSV format. It will also generate a
graphical reprensentation of the callgraph in DOT and PDF format.
An example tuning workflow might include the following steps:
1. Run `analyze-ptx.py` on the "shadeops" and "rend_lib" PTX files to generate
a list of the functions contained in those modules.
```bash
$ analyze_ptx.py shadeop_cuda.ptx
$ analyze_ptx.py rend_lib_myrender.ptx
```

2. Run `analyze-ptx.py` on the generated PTX for a representative shader:

```bash
$ analyze_ptx.py myshader.ptx
```

3. View the summary file (`myshader-summary.txt`) and the callgraph
(`myshader-callgraph.gv`) to deterimine which library functions were _not_
inlined. They will appear as boxes with a dashed outline in the callgraph.

In particular, be on the lookout for trivial functions (e.g., `osl_floor_ff`)
which have not been inlined. If such functions appear, that might be a sign
that the inline thresholds need to be adjusted, or that it might be
beneficial to register specific functions.
Loading

0 comments on commit f449749

Please sign in to comment.