How to increase GPU inference performance #7665

trainchoo · 2021-05-12T01:57:41Z

trainchoo
May 12, 2021

I have a small-ish LSTM model built in Keras and exported to onnx so I can do inference in a c# app.
To run 100 loops on the CPU I get ~270 ms total compute time. But if I enable CUDA exec provider, i get a compute time of 1500+ ms. I've looked through the tutorials and made sure all the versions of Cuda and cudnn are correct, and i couldnt figure out the cause of the slowdown. My only guess is running one session at a time on the GPU is inefficient? Any way i can do a batch of a same model to speed up inference?

Model sequence:
Input - 30 floats
LSTM - 120 nodes
LSTM - 120 nodes
Dense - 30 nodes
Dense - 2 nodes

Test code:

            List<float[]> inputList = new List<float[]>();
            Random rand = new Random();
            for (int i = 0; i < 100; i++)
            {
                float[] a = new float[30];
                for (int j = 0; j < 30; j++)
                {
                    a[j] = (float)rand.NextDouble();
                }
                inputList.Add(a);
            }
            List<float[]> onnxOut = new List<float[]>();

            SessionOptions so = new SessionOptions();
            so.AppendExecutionProvider_CUDA(0);

            var session = new InferenceSession("PyModel/Models/Sandy/Model.onnx", so);
            var inputMeta = session.InputMetadata;
            List<NamedOnnxValue> inps = new List<NamedOnnxValue>();

            for (int i = 0; i < inputList.Count; i++)
            {
                var tensor = new DenseTensor<float>(inputList[i], new int[] { 1, 30, 1 });
                inps.Add(NamedOnnxValue.CreateFromTensor<float>("InputLayer", tensor));
                using (var results = session.Run(inps))
                {
                    var res1 = results.First().AsEnumerable<float>().ToArray();
                    onnxOut.Add(res1);
                }               
                inps.Clear();
            }

ytaous · 2021-05-12T02:30:53Z

ytaous
May 12, 2021
Collaborator

@trainchoo - it might help to check what's the model looked like. Some ops may introduce memcpy between cpu and gpu, which could impact perf. Is that model sharable?

Also - did you have your ORT built with CUDA enabled?
#5188

@yuslepukhin - do you spot any potential issue on the c# code above? thx

1 reply

ytaous May 12, 2021
Collaborator

adding @skottmckay / @pranavsharma for feedback

trainchoo · 2021-05-12T03:29:04Z

trainchoo
May 12, 2021
Author

Thanks.

I exported the model using this library: https://github.com/onnx/tensorflow-onnx
There was nothing in the documentation about setting it up for CUDA enabling. Do you know another onnx export library I should use to export a python TF model?
I made sure to follow tf/keras documentation to enable GPU computation with LSTM layers. --the original (non-onnx) model runs really fast in Python.

I dont mind sharing the model: easyupload link
Here is the netron app image of it if you dont want to d/l it

2 replies

ytaous May 12, 2021
Collaborator

Thanks, can you please also share your system information?

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
ONNX Runtime installed from (source or binary):
ONNX Runtime version:
Python version:
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

trainchoo May 12, 2021
Author

Sure,

OS Platform and Distribution: Win 10 Pro v.10.0.19041
ONNX Runtime installed from (source or binary): I assume binary - installed via VS NuGet packet manager
ONNX Runtime version: Microsoft.ML.OnnxRuntime v1.7.0 and Microsoft.ML.OnnxRuntime.Gpu v1.7.1
Python version: 3.8.8
Visual Studio version (if applicable): Visual Studio Community 2019 v.16.9.3
GCC/Compiler version (if compiling from source): I'm using a .net framewrok v4.7.2 and the latest version of C# (8.0)
CUDA/cuDNN version: Cuda v11; cudnn v8.2
GPU model and memory: NVIDIA GF GTX 960, vram 2007MB
CPU: AMD Ryzen 5 3600; ram:16GB

pranavsharma · 2021-05-12T18:07:51Z

pranavsharma
May 12, 2021
Maintainer

It's not clear how the time was measured. Does the time include session creation time?
Try copying the input to the GPU and pre-allocating the output on the GPU before invoking Run. This will be eliminate the time needed for input/output copies. This can be achieved using iobinding. For e.g.

onnxruntime/csharp/test/Microsoft.ML.OnnxRuntime.Tests/InferenceTest.cs

Line 2153 in 0dbe51b

private void TestIOBinding()

.

12 replies

trainchoo May 14, 2021
Author

I tested the timing for onnx in python, and it takes ~27 ms to run all 500 samples (passed into the run function as entire np array), whereas the original Keras model runs for ~115ms to get the same output. Onnx is clearly faster.
My guess is that it takes 7000+ ms in C# is because i pass the input as a single sample at a time?

Here is the entire python code (and the models) i used to test this: easy upload link

trainchoo May 14, 2021
Author

Okay, nevermind; I'm an idiot. For some reason I couldn't get inference to work when passing the entire dataset at once to the run function, but now it works and I can get all 500 samples inferred at the same time if I just build a Tensor large enough.

Also passing the whole 500xN tensor into the model I get ~95 ms compute time with CPU, and ~56 ms compute time with GPU.
It's still 2x slower in C# than it is in Python, but its way better than 7s. I wonder why its slower in C# though.

ytaous May 14, 2021
Collaborator

@trainchoo - no worry, in fact I just finished the test on Linux, As I expected, due to memcpy between devices, in addition, a lot of nodes in this model has no CUDA support yet. Thus you don't see much gain from GPU. But thanks for providing the python sample code. Here are a few pointers to share:

the profiling was done using nvprof cmd, you should have the same on windows as long as you've installed the sdk along with cuda installation. Just couble click on the .nvvp file should the same screen above.

nvprof -f --print-gpu-summary --log-file gpu_summary
nvprof -f -o test.nvvp

I found a few more sample code here, you can play with it more:

onnxruntime/test/python/onnxruntime_test_python.py
onnxruntime/test/python/onnxruntime_test_python_iobinding.py

The code that runs with GPU: (not sure it's the same as yours), there are two versions, one is using iobinding.

import onnxruntime as rt
from random import random
import numpy as np
import time

length = 500
x = np.random.random(size=(length,30))
x = np.reshape(x, (length, 30, 1))

input = rt.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)

modelSavePath = "Models/Sandy"
session = rt.InferenceSession(modelSavePath+"/Model.onnx")
io_binding = session.io_binding()

io_binding.bind_input('InputLayer', 'cuda', 0, np.float32, [length, 30, 1], input.data_ptr())

output = rt.OrtValue.ortvalue_from_shape_and_type([length, 2], np.float32, 'cuda', 0)
io_binding.bind_output('OutputLayer', 'cuda', 0, np.float32, [length, 2], output.data_ptr())

start_time = time.time()
session.run_with_iobinding(io_binding)

ort_output_vals = io_binding.copy_outputs_to_cpu()[0]
print("Onnx time: " + str(time.time() - start_time))
import onnxruntime as rt
from random import random
import numpy as np
import time

length = 500
x = np.random.random(size=(length,30))
x = np.reshape(x, (length, 30, 1))

input = rt.OrtValue.ortvalue_from_numpy(x, 'cuda', 0)

modelSavePath = "Models/Sandy"
session = rt.InferenceSession(modelSavePath+"/Model.onnx")
io_binding = session.io_binding()

io_binding.bind_input('InputLayer', 'cuda', 0, np.float32, [length, 30, 1], input.data_ptr())

output = rt.OrtValue.ortvalue_from_shape_and_type([length, 2], np.float32, 'cuda', 0)
io_binding.bind_output('OutputLayer', 'cuda', 0, np.float32, [length, 2], output.data_ptr())

start_time = time.time()
session.run_with_iobinding(io_binding)

ort_output_vals = io_binding.copy_outputs_to_cpu()[0]
print("Onnx time: " + str(time.time() - start_time))

second version:

so = rt.SessionOptions()
sess = rt.InferenceSession(modelSavePath+"/Model.onnx", so, ['CUDAExecutionProvider'])

start_time = time.time()
res = sess.run(["OutputLayer"], {"InputLayer": x.astype(np.float32)})
print("Onnx time: " + str(time.time() - start_time))

The actual graph executed (I but the binary from master branch dated 4/26)

At this point, we are primarily focusing on production-used models scenarios, e.g. Bert, HiggingFace, and identify the optimization opportunity to make it fast with additional CUDA support (and reducing memcpy). Although some nodes are placed CPU, but if you have a really big model, the power of GPU will definitely stand out.

ref: onnxruntime/core/providers/cuda/cuda_execution_provider.cc

trainchoo May 15, 2021
Author

Thanks again. I'll play around with larger models to see the effect of time savings.

Last question, is it more efficient to bind the data into GPU memory manually through the bind_input/output command, or let the runtime do that automatically? Because as you said, some nodes in my model fallback to CPU, so copying all data into the GPU is a waste of some time, no?

roushrsh Apr 20, 2023

Okay, nevermind; I'm an idiot. For some reason I couldn't get inference to work when passing the entire dataset at once to the run function, but now it works and I can get all 500 samples inferred at the same time if I just build a Tensor large enough.

Also passing the whole 500xN tensor into the model I get ~95 ms compute time with CPU, and ~56 ms compute time with GPU. It's still 2x slower in C# than it is in Python, but its way better than 7s. I wonder why its slower in C# though.

What did you do to fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to increase GPU inference performance #7665

{{title}}

Replies: 3 comments 15 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to increase GPU inference performance #7665

trainchoo May 12, 2021

Replies: 3 comments · 15 replies

ytaous May 12, 2021 Collaborator

ytaous May 12, 2021 Collaborator

trainchoo May 12, 2021 Author

ytaous May 12, 2021 Collaborator

trainchoo May 12, 2021 Author

pranavsharma May 12, 2021 Maintainer

trainchoo May 14, 2021 Author

trainchoo May 14, 2021 Author

ytaous May 14, 2021 Collaborator

trainchoo May 15, 2021 Author

roushrsh Apr 20, 2023

trainchoo
May 12, 2021

Replies: 3 comments 15 replies

ytaous
May 12, 2021
Collaborator

ytaous May 12, 2021
Collaborator

trainchoo
May 12, 2021
Author

ytaous May 12, 2021
Collaborator

trainchoo May 12, 2021
Author

pranavsharma
May 12, 2021
Maintainer

trainchoo May 14, 2021
Author

trainchoo May 14, 2021
Author

ytaous May 14, 2021
Collaborator

trainchoo May 15, 2021
Author