Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify curand_init for 2x performance improvement #2

Open
rogerallen opened this issue Apr 16, 2019 · 7 comments
Open

Modify curand_init for 2x performance improvement #2

rogerallen opened this issue Apr 16, 2019 · 7 comments

Comments

@rogerallen
Copy link
Owner

Change

curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);

to

curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);

for 2x speedup. Some info at: https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes

The first call has a fixed random seed, different sequence ids and a fixed offset into that sequence. That creates a different sequence per thread (more overhead).

The second call has different seeds and the same sequence & offset. I think this means it only generates one sequence for all the threads & the different seeds allow for enough randomness without so much overhead.

I had tried this out when I originally created the code, but read the instructions too quickly & messed up. I modified the 3rd parameter, not the first. Doh!

@rogerallen rogerallen changed the title Modify Modify curand_init for performance improvement Apr 16, 2019
@rogerallen rogerallen changed the title Modify curand_init for performance improvement Modify curand_init for 2x performance improvement Jul 2, 2019
@artmortal93
Copy link

Hi there, i also found that using the old syntax:
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);
will generate a error to make cudaDeviceSynchornize return error code 4, but changing to the new syntax everything get along well. I am testing the code on GTX 1060 on windows, it seems that it's related to below post:
https://stackoverflow.com/questions/42607562/curand-init-produces-memory-access-errors

@rogerallen
Copy link
Owner Author

Thanks for the report. That stackoverflow discussion resolved that nSight debugger was at fault. Is that your case as well?

@r-gr
Copy link

r-gr commented Jun 8, 2020

In my case, changing the original line to
curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
causes a cudaErrorLaunchFailure (719) at run time in the cudaDeviceSynchronize() after the render<<<>>>() call.

The original version executes successfully without any issue.

Bizarrely, after changing the line back to the original version,
curand_init(1984, pixel_index, 0, &rand_state[pixel_index]);,
recompiling and running, the CUDA error 719 persists in the same place every time until I do one of a few things:

  • reboot my machine
  • wait about 15-20 minutes
  • reduce the image dimensions (e.g. 800x450 -> 400x225)
  • reduce the number of samples per pixel (e.g. 100 -> 60)

In the first two cases, the original version then runs successfully every time; in the latter two cases, reverting those changes results in the error again, even for the original version.

I have absolutely no idea what is going on here but it seems like the modified version of the curand_init() call breaks something to do with the device's memory at the driver level, so rebooting or waiting for some background cleanup operation to run fixes whatever was broken.

Sorry I can't be more specific, I've only just started learning about this stuff and I'm not yet familiar with the debugging tools etc. I also normally do all development work on Unix-y systems but I've been doing this on Windows. Happy to dig further into this if anyone can point me in the right direction.

Windows 10, GTX 760 2GB, compiling with the Visual Studio 2019 CUDA tools.

@rogerallen
Copy link
Owner Author

@r-gr Since it seems related to the complexity of the shader, can you look at whether TDR timeouts might be the issue? See #7 (comment)

An error 719 does seem rather serious and could indicate that maybe this code doesn't work well on Kepler (GK10x) class chips? I personally have not tried. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html

@r-gr
Copy link

r-gr commented Jun 9, 2020

@rogerallen Increasing the timeout does appear to allow the modified version to execute successfully. Thanks!

This might be outside the scope of this discussion but this finding immediately leads to the question: how might one go about writing software which utilises the GPU for computation but which is stable across the full range of compatible hardware?

When writing software to run on the CPU under an OS, it's taken for granted that no matter how complex and long-running the computations, the OS scheduler will have some fairness mechanism so that one process doesn't starve all the others of resources. Is there any such mechanism for the GPU?

In the context of this ray tracer, I could manually split the render<<<>>>() call into multiple calls, each covering a smaller block of pixels... But is there a programmatic way to split up GPU computations so that they run in reasonably sized chunks in order to best utilise the most powerful hardware while avoiding hitting the timeout limits on weaker hardware?

Edit: reading the Blender documentation on GPU rendering, it seems to imply that it's really a matter of hand-picking a sensible amount of computations to send the GPU at one time (the tile size). If I'm not just completely wrong and misunderstanding this, is anyone working on some kind of fairness-based work scheduling system for GPUs? Would such a system even make sense in existing hardware or would there be issues like context switching being prohibitively expensive?

@rogerallen
Copy link
Owner Author

Glad to hear that helped.

Yeah, this is a rather involved discussion and answers will depend on details of your program. You are basically on the right track, but this seems like a better topic for NVIDIA CUDA Developer forums. https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206

I will say that NVIDIA GPUs do have improved context switching capabilities since the Kepler days. E.g. https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/10

rogerallen added a commit that referenced this issue Oct 12, 2020
- fix issue #2 - for this branch only.
- fix issue #8 - forgot to free d_rand_state2
@rogerallen
Copy link
Owner Author

Resolved in the ch12_where_next_cuda branch. Leaving open as this is not fixed in all branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants