Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
## Description This PR is an overhaul of PTX compilation in OSL. The goal is to minimize the size of the generated PTX to help reduce time spent in subsequent stages (e.g., module creation, pipeline linking, etc.). ### rend_lib The main functional change is in how the functions defined in rend_lib.cu are handled. In the current setup, rend_lib.cu is compiled to a standalone PTX file using nvcc. The functions are considered `extern` in generated shaders, and the dependencies are satisfied when the final pipeline is linked together. Since the functions are `extern`, they are not available for optimization when each shader is generated. This prevents the inlining of small functions, which incurs a lot of function call overhead in terms of PTX size and run-time cost. So instead of compiling rend_lib.cu separately, it is compiled to LLVM bitcode along with the rest of the "shadeops" sources. This bitcode is used to seed the LLVM Module for each shader. This makes the definitions available when each shader is being generated and optimized, which allows the optimizer to make better decisions. It also allows for better control of inlining decisions which affect the size and quality of the generated PTX. ### shadeops Another significant functional change is in how the shadeops functions (including those defined in rend_lib.cu) are treated. In the current pipeline, each shader must carry along with it definitions for all non-inlined shadeops functions that are used. This bloats the size of the PTX for each shader, and results in many duplicate copies of those functions existing in the final linked pipeline. In the new pipeline, the unified shadeops+rend_lib bitcode is compiled to PTX using the offline `llc` tool. This PTX is used by the renderer to create a single "shadeops" module that provides the definitions for those functions. This makes it possible to drop the definitions for those functions from the PTX for each shader, which can significantly reduce the size of the generated PTX. ### quirks This PR contains some interesting "quirks" which are largely fallout from compiling rend_lib.cu using clang instead of nvcc: - The clang-generated PTX is processed by a Python script to transform some function & symbol declarations to better match what was generated by nvcc. It might be possible to add the right decorators to the C++/CUDA code to get the correct visibility for these symbols, but I was not successful in doing so. - It was necessary to change the function signatures for some of the rend_lib functions to use `void*` instead of the actual OSL types (e.g., `OSL::Color3*`). Trying to use the actual types produces an LLVM assertion failure, for reasons that I don't understand. - I have enabled flush-to-zero (FTZ) for clang-compiled code, to better match the behavior of nvcc. I added a CMake variable to enable or disable FTZ in the event that somebody wants to disable it. - I have added some alignment hints on pointer accesses in rend_lib.cu. clang does not appear to be able to deduce the alignment of pointers in many cases, which turns things like `memcpy` and `memset` into long series of byte operations. Functions like `osl_get_matrix` are bloated substantially if the hints aren't provided. --------- Signed-off-by: Tim Grant <[email protected]>
- Loading branch information