Modify curand_init for 2x performance improvement #2

rogerallen · 2019-04-16T18:17:53Z

Change

curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);

to

curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);

for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes

The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).

The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.

I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!

The text was updated successfully, but these errors were encountered:

artmortal93 · 2020-05-09T18:37:10Z

Hi there, i also found that using the old syntax:
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
will generate a error to make cudaDeviceSynchornize return error code 4, but changing to the new syntax everything get along well. I am testing the code on GTX 1060 on windows, it seems that it's related to below post:
https://stackoverflow.com/questions/42607562/curand-init-produces-memory-access-errors

rogerallen · 2020-05-10T16:15:00Z

Thanks for the report. That stackoverflow discussion resolved that nSight debugger was at fault. Is that your case as well?

r-gr · 2020-06-08T12:55:03Z

In my case, changing the original line to
curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
causes a cudaErrorLaunchFailure (719) at run time in the cudaDeviceSynchronize() after the render<<<>>>() call.

The original version executes successfully without any issue.

Bizarrely, after changing the line back to the original version,
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);,
recompiling and running, the CUDA error 719 persists in the same place every time until I do one of a few things:

reboot my machine
wait about 15-20 minutes
reduce the image dimensions (e.g. 800x450 -> 400x225)
reduce the number of samples per pixel (e.g. 100 -> 60)

In the first two cases, the original version then runs successfully every time; in the latter two cases, reverting those changes results in the error again, even for the original version.

I have absolutely no idea what is going on here but it seems like the modified version of the curand_init() call breaks something to do with the device's memory at the driver level, so rebooting or waiting for some background cleanup operation to run fixes whatever was broken.

Sorry I can't be more specific, I've only just started learning about this stuff and I'm not yet familiar with the debugging tools etc. I also normally do all development work on Unix-y systems but I've been doing this on Windows. Happy to dig further into this if anyone can point me in the right direction.

Windows 10, GTX 760 2GB, compiling with the Visual Studio 2019 CUDA tools.

rogerallen · 2020-06-08T17:28:45Z

@r-gr Since it seems related to the complexity of the shader, can you look at whether TDR timeouts might be the issue? See #7 (comment)

An error 719 does seem rather serious and could indicate that maybe this code doesn't work well on Kepler (GK10x) class chips? I personally have not tried. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

r-gr · 2020-06-09T20:30:18Z

@rogerallen Increasing the timeout does appear to allow the modified version to execute successfully. Thanks!

This might be outside the scope of this discussion but this finding immediately leads to the question: how might one go about writing software which utilises the GPU for computation but which is stable across the full range of compatible hardware?

When writing software to run on the CPU under an OS, it's taken for granted that no matter how complex and long-running the computations, the OS scheduler will have some fairness mechanism so that one process doesn't starve all the others of resources. Is there any such mechanism for the GPU?

In the context of this ray tracer, I could manually split the render<<<>>>() call into multiple calls, each covering a smaller block of pixels... But is there a programmatic way to split up GPU computations so that they run in reasonably sized chunks in order to best utilise the most powerful hardware while avoiding hitting the timeout limits on weaker hardware?

Edit: reading the Blender documentation on GPU rendering, it seems to imply that it's really a matter of hand-picking a sensible amount of computations to send the GPU at one time (the tile size). If I'm not just completely wrong and misunderstanding this, is anyone working on some kind of fairness-based work scheduling system for GPUs? Would such a system even make sense in existing hardware or would there be issues like context switching being prohibitively expensive?

rogerallen · 2020-06-09T21:06:12Z

Glad to hear that helped.

Yeah, this is a rather involved discussion and answers will depend on details of your program. You are basically on the right track, but this seems like a better topic for NVIDIA CUDA Developer forums. https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206

I will say that NVIDIA GPUs do have improved context switching capabilities since the Kepler days. E.g. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10

- fix issue #2 - for this branch only. - fix issue #8 - forgot to free d_rand_state2

rogerallen · 2020-10-12T17:13:28Z

Resolved in the ch12_where_next_cuda branch. Leaving open as this is not fixed in all branches.

rogerallen changed the title ~~Modify~~ Modify curand_init for performance improvement Apr 16, 2019

rogerallen mentioned this issue Jun 6, 2019

Error when initialize 'rand_state' at the start of render() kernel #3

Closed

rogerallen changed the title ~~Modify curand_init for performance improvement~~ Modify curand_init for 2x performance improvement Jul 2, 2019

rogerallen added a commit that referenced this issue Oct 12, 2020

Bug fixes

e75548b

- fix issue #2 - for this branch only. - fix issue #8 - forgot to free d_rand_state2

rogerallen mentioned this issue Apr 12, 2024

CUDA doesn't accelerate my ray tracer #12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify curand_init for 2x performance improvement #2

Modify curand_init for 2x performance improvement #2

rogerallen commented Apr 16, 2019

artmortal93 commented May 9, 2020

rogerallen commented May 10, 2020

r-gr commented Jun 8, 2020

rogerallen commented Jun 8, 2020

r-gr commented Jun 9, 2020 •

edited

Loading

rogerallen commented Jun 9, 2020

rogerallen commented Oct 12, 2020

Modify curand_init for 2x performance improvement #2

Modify curand_init for 2x performance improvement #2

Comments

rogerallen commented Apr 16, 2019

artmortal93 commented May 9, 2020

rogerallen commented May 10, 2020

r-gr commented Jun 8, 2020

rogerallen commented Jun 8, 2020

r-gr commented Jun 9, 2020 • edited Loading

rogerallen commented Jun 9, 2020

rogerallen commented Oct 12, 2020

r-gr commented Jun 9, 2020 •

edited

Loading