You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi :) In our halide applications, we see that OpenCL compile times can be quite high, especially with Intel. After startup, it takes 6 seconds before our application is able to use the halide pipelines. We use almost all of the kernels in one go, so lazy-loading kernels is not an option. Ultimately, our goal is to reduce the time between application start and running a full halide pipeline. Do you have any tips for improving this? Is it possible to pre-compile OpenCL to multiple IRs like SPIR, PTX and load them at runtime if the driver supports them? If this is currently not supported, what would be the right place to implement this?
Background: Our halide pipeline is written in python and is consumed by our applications using the output of compile_to_static_library. Our moderate-sized halide pipeline boils down to 28 clBuildProgram calls at runtime. The plain text kernels are cumulatively 1.3MiB in size. The total duration for all clBuildProgram calls is about 6 seconds, with the largest kernels taking ~0.75 seconds per clBuildProgram. We measure this with Intel GPUs on linux using Halide's OpenCL target. Other GPU vendors may have faster clBuildProgram times but we do not impose any limitations on what GPU or OpenCL driver to use with the halide OpenCL backend.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi :) In our halide applications, we see that OpenCL compile times can be quite high, especially with Intel. After startup, it takes 6 seconds before our application is able to use the halide pipelines. We use almost all of the kernels in one go, so lazy-loading kernels is not an option. Ultimately, our goal is to reduce the time between application start and running a full halide pipeline. Do you have any tips for improving this? Is it possible to pre-compile OpenCL to multiple IRs like SPIR, PTX and load them at runtime if the driver supports them? If this is currently not supported, what would be the right place to implement this?
Background: Our halide pipeline is written in python and is consumed by our applications using the output of
compile_to_static_library
. Our moderate-sized halide pipeline boils down to 28clBuildProgram
calls at runtime. The plain text kernels are cumulatively 1.3MiB in size. The total duration for allclBuildProgram
calls is about 6 seconds, with the largest kernels taking ~0.75 seconds perclBuildProgram
. We measure this with Intel GPUs on linux using Halide's OpenCL target. Other GPU vendors may have fasterclBuildProgram
times but we do not impose any limitations on what GPU or OpenCL driver to use with the halide OpenCL backend.Beta Was this translation helpful? Give feedback.
All reactions