Skip to content

OpenCL sin Performance

Yichao Yu edited this page Jan 9, 2021 · 5 revisions

Similar to the CPU test, we will first measure the performance while involving as little memory access as possible. Due to the complexity of the OpenCL driver, we will also measure some overhead related to scheduling a job from the CPU, waiting for the previous dependent job to finish and to run a dummy kernel. These numbers should give us an idea about how much job we need to schedule to avoid being bottlenecked by these overhead. The roundtrip time for scheduling a single dummy kernel will also give us an upper bound on the overhead latency we should expect though we still need to measure that in a more realistic setting later.

In order to force the CPU to do computation without accessing memory, we use asm volatile to create a dummy use of the result on the CPU. AFAICT, we do not have anything as direct as this in OpenCL so we need to find another way. In this test, we do this by storing the result to memory behind a branch that will never be taken at the runtime. We also make the condition of the store depend on the calculated value such that the compiler will not be able to move the computation into the same branch. More specifically, we have something similar to

float res = amp * sin(...);
if (res > threshold) {
   // store `res` to memory
}

As long as we pass in an amp that is significantly smaller than threshold the branch will never be taken and the compiler will generally not optimize this case out.

As mentioned in the accuracy test we will test both sin and native_sin. For each tests, including the dummy one mentioned above, we will vary the dimention we run each kernel on and the number of repitition we schedule this in the command queue. We do this for a command queue that is either in order or out of order. For the computation test (i.e. not dummy), we also vary the number of times we evaluate the sin/native_sin function inside the kernel to minimize the effect of the kernel overhead on the measurement.

The full code for the test can be found in opencl-dry-compute.cpp and the results can be found under data/cl-dry-compute.

As mentioned before, we have three different platforms to test.

  1. Intel OpenCL CPU runtime

    1. i7-6700K

      Performance

    2. i9-10885H

      Performance

  2. Intel Compute OpenCL runtime

    1. UHD 530

      Performance

    2. UHD 640

      Performance

  3. AMD ROCm OpenCL driver

    AMD Radeon RX 5500 XT

    Performance