Why is DirectML 2x as slow as CUDA on Nvidia 1080 GTX (actually Nvidia Quadro P5000)? #14353

elephantpanda · 2023-01-19T00:11:28Z

elephantpanda
Jan 19, 2023

I am running stable diffusion converted to float16 using Onnxruntime. And am getting about 1.5 seconds per cycle when using DirectML. So I installed the CUDA files (not easy as there's 10 dlls to add which when added to GPU onnx dll comes to nearly 2GB) and now it is 0.7 seconds. Twice as fast.

Am I doing something wrong or is the DirectML framework just really badly optimised for NVidia 1080 GTX ? Specifically I have NVidia Quadro P5000 but its supposed to be equivalent.
This surprises me as this is a very popular graphics card.
I have the latest DirectML.dll installed from the nuget package.

Also, DirectML takes 10 seconds to load the unet onnx model while CUDA takes only about 5 seconds.

Obviously I would prefer to use the DirectML but with 50% speed reduction that doesn't seem good.

BTW I am running it in Unity c# with a DirectX 12 environment, if that's important. I am using the Shadow PC cloud service.

So my question is, have I missed out some vital setting to make DirectML run 2x as fast? Or is it just how things are? I would prefer to use DirectML because the CUDA runtime is up to 2GB!!!!

Answered by elephantpanda

Feb 2, 2023

I have solved this problem by updating to the latest build of OnnxRuntime 1.14

There is still a slight problem that it becomes slower than CUDA again if you change the batch size of your input, whereas CUDA lets you change the batch size with no penalty.

View full answer

RyanUnderhill · 2023-02-01T05:58:06Z

RyanUnderhill
Feb 1, 2023
Collaborator

@fdwr (In case you know offhand)

1 reply

fdwr Feb 2, 2023
Collaborator

@jeffbloo would be more knowledgeable here regarding performance matters, especially regarding float16.

elephantpanda · 2023-02-02T13:53:47Z

elephantpanda
Feb 2, 2023
Author

I have solved this problem by updating to the latest build of OnnxRuntime 1.14

There is still a slight problem that it becomes slower than CUDA again if you change the batch size of your input, whereas CUDA lets you change the batch size with no penalty.

1 reply

fdwr Feb 3, 2023
Collaborator

Paul: Glad to hear 1.14 is faster for your model. That's probably Jeff or Pat's contributions. Yes, DML benefits more from known sizes for resource planning. One potential mitigation would be to use a padded batch size. e.g. If your input batch sizes are {6,7,5}, then using a batch size of 8 with zero padding at the end would process more info than necessary, but it might be faster than modifying the batch size in each call. Additionally, calling add_free_dimension_override_by_name in the session options can benefit, since more constant folding after shape inference would apply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is DirectML 2x as slow as CUDA on Nvidia 1080 GTX (actually Nvidia Quadro P5000)? #14353

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why is DirectML 2x as slow as CUDA on Nvidia 1080 GTX (actually Nvidia Quadro P5000)? #14353

elephantpanda Jan 19, 2023

Replies: 2 comments · 2 replies

RyanUnderhill Feb 1, 2023 Collaborator

fdwr Feb 2, 2023 Collaborator

elephantpanda Feb 2, 2023 Author

fdwr Feb 3, 2023 Collaborator

elephantpanda
Jan 19, 2023

Replies: 2 comments 2 replies

RyanUnderhill
Feb 1, 2023
Collaborator

fdwr Feb 2, 2023
Collaborator

elephantpanda
Feb 2, 2023
Author

fdwr Feb 3, 2023
Collaborator