You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
C++ API, AMD GPU (Radeon Pro W5700).
Using @fdwr's sample as a guide (thanks!) I'm able to run our onnx models with GPU tensors using D3D12 buffers.
Timing the execution with both Ort's internal profiling option and std::chrono::high_resolution_clock I get:
using CPU tensors
session.Run() ~17ms
ioBinding.SynchronizeOutputs() ~ 6μs
using GPU tensors
session.Run() ~150μs
ioBinding.SynchronizeOutputs() ~ 16ms
(remark: this is for the 2nd call to session.Run() as the 1st seems to have some initialization overhead)
So overall both cases end up with the same execution time, which strikes me as peculiar.
If I get this correctly, in the 1st case the call to Run() does the inference and also transfers the output from the GPU to the output tensor buffer on the CPU, so SynchronizeOutputs() is no-op in this case, and the overall time is to be expected, I guess.
In the case of GPU tensors the situation is reversed: the inference itself(?) is very fast but now the ioBinding synchronization takes up the bulk of the execution time, but why?
Edit: even if SynchronizeOutputs() flushes the command list I'd still expect even a slight better performance overall as there should be no GPU-CPU transfers.
Can anyone shed some light on this issue? What does SynchronizeOutputs() actually do in the 2nd case?
ep:DMLissues related to the DirectML execution provider
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
C++ API, AMD GPU (Radeon Pro W5700).
Using @fdwr's sample as a guide (thanks!) I'm able to run our onnx models with GPU tensors using D3D12 buffers.
Timing the execution with both Ort's internal profiling option and
std::chrono::high_resolution_clock
I get:using CPU tensors
session.Run()
~17msioBinding.SynchronizeOutputs()
~ 6μsusing GPU tensors
session.Run()
~150μsioBinding.SynchronizeOutputs()
~ 16ms(remark: this is for the 2nd call to
session.Run()
as the 1st seems to have some initialization overhead)So overall both cases end up with the same execution time, which strikes me as peculiar.
If I get this correctly, in the 1st case the call to
Run()
does the inference and also transfers the output from the GPU to the output tensor buffer on the CPU, soSynchronizeOutputs()
is no-op in this case, and the overall time is to be expected, I guess.In the case of GPU tensors the situation is reversed: the inference itself(?) is very fast but now the ioBinding synchronization takes up the bulk of the execution time, but why?
Edit: even if
SynchronizeOutputs()
flushes the command list I'd still expect even a slight better performance overall as there should be no GPU-CPU transfers.Can anyone shed some light on this issue? What does
SynchronizeOutputs()
actually do in the 2nd case?Beta Was this translation helpful? Give feedback.
All reactions