You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have gotten the cuda interface to work in python now, and I am seeing a lot of memory allocations when I call the dll. It's way faster than what I was doing before.
I was messing around with the Nvidia Performance Primitives, and many of these functions require a scratch buffer that is pre allocated - So I can just pass an already allocated array to the NPP dll, and then all of the memory is set up.
So what I am asking is the following:
Am I correct in assuming that the cuda interface does not have pointers for all of the things that it needs to do the calculation?
Is there theoretically a way to change the interface so that you can precompute the amount of "scratch buffer" needed for gpufit, and then eliminate some of the memory allocations?
looking at the traces I am getting about 50% of a call to gpufit's cuda interface as being memory allocations, while with the NPP stuff I never see any memory allocations... because i give it a preallocated scratch buffer.
I'm willing to look into this myself and contribute; it's just helpful to hear thoughts on this.
EDIT: oh, pybind11. Are there any performance advantages to using pybind11 instead of ctypes? I just used ctypes but I was interested in the idea that pybind11 requires you to modify cpp source, so I would assume the integration is a little bit tighter.
The text was updated successfully, but these errors were encountered:
I have gotten the cuda interface to work in python now, and I am seeing a lot of memory allocations when I call the dll. It's way faster than what I was doing before.
I was messing around with the Nvidia Performance Primitives, and many of these functions require a scratch buffer that is pre allocated - So I can just pass an already allocated array to the NPP dll, and then all of the memory is set up.
So what I am asking is the following:
Am I correct in assuming that the cuda interface does not have pointers for all of the things that it needs to do the calculation?
Is there theoretically a way to change the interface so that you can precompute the amount of "scratch buffer" needed for gpufit, and then eliminate some of the memory allocations?
looking at the traces I am getting about 50% of a call to gpufit's cuda interface as being memory allocations, while with the NPP stuff I never see any memory allocations... because i give it a preallocated scratch buffer.
I'm willing to look into this myself and contribute; it's just helpful to hear thoughts on this.
EDIT: oh, pybind11. Are there any performance advantages to using pybind11 instead of ctypes? I just used ctypes but I was interested in the idea that pybind11 requires you to modify cpp source, so I would assume the integration is a little bit tighter.
The text was updated successfully, but these errors were encountered: