Why is DirectML 2x as slow as CUDA on Nvidia 1080 GTX (actually Nvidia Quadro P5000)? #14353
-
I am running stable diffusion converted to float16 using Onnxruntime. And am getting about 1.5 seconds per cycle when using DirectML. So I installed the CUDA files (not easy as there's 10 dlls to add which when added to GPU onnx dll comes to nearly 2GB) and now it is 0.7 seconds. Twice as fast. Am I doing something wrong or is the DirectML framework just really badly optimised for NVidia 1080 GTX ? Specifically I have NVidia Quadro P5000 but its supposed to be equivalent. Also, DirectML takes 10 seconds to load the unet onnx model while CUDA takes only about 5 seconds. Obviously I would prefer to use the DirectML but with 50% speed reduction that doesn't seem good. BTW I am running it in Unity c# with a DirectX 12 environment, if that's important. I am using the Shadow PC cloud service. So my question is, have I missed out some vital setting to make DirectML run 2x as fast? Or is it just how things are? I would prefer to use DirectML because the CUDA runtime is up to 2GB!!!! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
@fdwr (In case you know offhand) |
Beta Was this translation helpful? Give feedback.
-
I have solved this problem by updating to the latest build of OnnxRuntime 1.14 There is still a slight problem that it becomes slower than CUDA again if you change the batch size of your input, whereas CUDA lets you change the batch size with no penalty. |
Beta Was this translation helpful? Give feedback.
I have solved this problem by updating to the latest build of OnnxRuntime 1.14
There is still a slight problem that it becomes slower than CUDA again if you change the batch size of your input, whereas CUDA lets you change the batch size with no penalty.