Why is SynchronizeOutputs() so slow using GPU tensors? [DML] #14421

Nietaa · 2023-01-25T13:22:50Z

Nietaa
Jan 25, 2023

C++ API, AMD GPU (Radeon Pro W5700).
Using @fdwr's sample as a guide (thanks!) I'm able to run our onnx models with GPU tensors using D3D12 buffers.

Timing the execution with both Ort's internal profiling option and std::chrono::high_resolution_clock I get:
using CPU tensors

session.Run() ~17ms
ioBinding.SynchronizeOutputs() ~ 6μs

using GPU tensors

session.Run() ~150μs
ioBinding.SynchronizeOutputs() ~ 16ms

(remark: this is for the 2nd call to session.Run() as the 1st seems to have some initialization overhead)

So overall both cases end up with the same execution time, which strikes me as peculiar.
If I get this correctly, in the 1st case the call to Run() does the inference and also transfers the output from the GPU to the output tensor buffer on the CPU, so SynchronizeOutputs() is no-op in this case, and the overall time is to be expected, I guess.

In the case of GPU tensors the situation is reversed: the inference itself(?) is very fast but now the ioBinding synchronization takes up the bulk of the execution time, but why?
Edit: even if SynchronizeOutputs() flushes the command list I'd still expect even a slight better performance overall as there should be no GPU-CPU transfers.

Can anyone shed some light on this issue? What does SynchronizeOutputs() actually do in the 2nd case?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is SynchronizeOutputs() so slow using GPU tensors? [DML] #14421

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why is SynchronizeOutputs() so slow using GPU tensors? [DML] #14421

Nietaa Jan 25, 2023

Replies: 0 comments

Nietaa
Jan 25, 2023