-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process on GPU killed after long run and/or restart #348
Comments
I have also tested the code on the HPC cluster where we are using TESLA V100s and I run into the same errors. I have attached the corresponding error log here as well but I don't think it really gives us any new information. |
See also inducer/pyopencl#562 (comment). |
Hi, sorry about the slow response. Is it possible that there is a blow up of the particles and a large increase of the domain size? As regards the restart it is possible that there is an issue with restarting on the GPU because some necessary state has not been saved. Do you have a small reproducible example? That would help debug this issue better. |
I don't see any blow up or large increase in the domain size when I look at the results of the last data frame. The crash happens after hours of simulation on the GPU (approximately 70k iterations). If I restart from the last output file on a CPU it just continues fine without blow up or large increase in domain size. I have attached the files of the simulations that gave us these errors (I changed the .py extension to .txt, otherwise I couldn't include them in this message). I will check if I also encounter this issue with a smaller example. We have been looking into the issue ourselves for a while as well. Inducer mentioned in the comment above: "An unsigned integer underflow comes to mind as a possible reason." The only place where I found unsigned integers had to do with particle indexes and are used in e.g. neighbor lists. Since the code runs fine on a CPU and this error is only occurring on a GPU we thought it possibly had to do with the specific implementation of neighbor lists on the GPU. Might it be that neighbor list memory on GPU is not dynamic and that the length of the neighbor lists has a fixed maximum? Say the maximum amount of neighbors in the neighbor list is 30 and at some moment during the simulations the amount of neighbors exceeds this number we might end up in these kind of unsigned integer underflow problems. If there is such a hard cap on the amount of neighbors I could try to change this to a larger number and see if that solves the issue, but I couldn't find anything on that matter. |
Dear Prabhu, The simulations just continue perfectly fine on a CPU, but when I continue them on a GPU they crash. If it would be a blow up of particles it would crash on CPU as well as on GPU right? I also don't think it is related to some necessary state that is not saved for the restart, because when we run the simulation from start it also crashes every time at the same point and from there on restarting gives the same error log as is returned during a run straight from the start. Maybe we can look into this issue in more detail together? Best, Stephan |
Dear developers,
I was using PySPH on my old GPU (GeForce GTX 680) for a while and I recently started using it on a newer GPU (NVIDIA TITAN V) as well. Unfortunately, there are some strange issues showing up, so hopefully someone can help me out here:
This problem doesn't appear on my GTX 680 but on that machine I am using an older version of PyOpenCL. I tried using this same older version of PyOpenCL on the TITAN V but it didn't solve the issue.
Best,
Stephan
gpu_err.log
The text was updated successfully, but these errors were encountered: