-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify curand_init for 2x performance improvement #2
Comments
Hi there, i also found that using the old syntax: |
Thanks for the report. That stackoverflow discussion resolved that nSight debugger was at fault. Is that your case as well? |
In my case, changing the original line to The original version executes successfully without any issue. Bizarrely, after changing the line back to the original version,
In the first two cases, the original version then runs successfully every time; in the latter two cases, reverting those changes results in the error again, even for the original version. I have absolutely no idea what is going on here but it seems like the modified version of the Sorry I can't be more specific, I've only just started learning about this stuff and I'm not yet familiar with the debugging tools etc. I also normally do all development work on Unix-y systems but I've been doing this on Windows. Happy to dig further into this if anyone can point me in the right direction. Windows 10, GTX 760 2GB, compiling with the Visual Studio 2019 CUDA tools. |
@r-gr Since it seems related to the complexity of the shader, can you look at whether TDR timeouts might be the issue? See #7 (comment) An error 719 does seem rather serious and could indicate that maybe this code doesn't work well on Kepler (GK10x) class chips? I personally have not tried. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html |
@rogerallen Increasing the timeout does appear to allow the modified version to execute successfully. Thanks! This might be outside the scope of this discussion but this finding immediately leads to the question: how might one go about writing software which utilises the GPU for computation but which is stable across the full range of compatible hardware? When writing software to run on the CPU under an OS, it's taken for granted that no matter how complex and long-running the computations, the OS scheduler will have some fairness mechanism so that one process doesn't starve all the others of resources. Is there any such mechanism for the GPU? In the context of this ray tracer, I could manually split the Edit: reading the Blender documentation on GPU rendering, it seems to imply that it's really a matter of hand-picking a sensible amount of computations to send the GPU at one time (the tile size). If I'm not just completely wrong and misunderstanding this, is anyone working on some kind of fairness-based work scheduling system for GPUs? Would such a system even make sense in existing hardware or would there be issues like context switching being prohibitively expensive? |
Glad to hear that helped. Yeah, this is a rather involved discussion and answers will depend on details of your program. You are basically on the right track, but this seems like a better topic for NVIDIA CUDA Developer forums. https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206 I will say that NVIDIA GPUs do have improved context switching capabilities since the Kepler days. E.g. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10 |
Resolved in the ch12_where_next_cuda branch. Leaving open as this is not fixed in all branches. |
Change
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
to
curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes
The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).
The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.
I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!
The text was updated successfully, but these errors were encountered: