Skip to content

OpenCL Event Overhead

Yichao Yu edited this page Jan 18, 2021 · 3 revisions
  1. Event callback overhead

    We noticed some possible overhead when we enabled event callbacks on each of the work. Since we may actually want to use the event callback, let's see if the overhead is real and if it is managable.

    In the previous test we enabled three event callbacks for every event but in the real code the most important one is the CL_COMPLETE event which tells us when we can push the data to the AWG driver. So let's measure the overhead with all (8) combinations of event callbacks enabled. The event callback types will be marked with a three-tuple in which the enable state of the callbacks are shown as 0 or 1 in the order of CL_SUBMITTED, CL_RUNNING and CL_COMPLETE events.

    The full code for the test can be found in opencl-event-overhead.cpp and the results can be found under data/cl-event-overhead.

    The test results are plotted below. Each lines corresponds to a different set of event callbacks enabled and are labeled using the three-tuple as mentioned above, The time per run is measured from the total time of 1024 repititions and each data point is measured about 150 to 200 times. The dots shows each individual measurement and the line shows the average.

    1. Intel UHD 530

      Timing

    2. Intel UHD 640

      Timing

    3. AMD Radeon RX 5500 XT

      Timing

    Huh, so there is a pretty big overhead on both GPUs and the overhead also seems to increase the uncertainty of the timing as well. Relatively speaking, the overhead on the AMD GPU is much larger (3x for small sizes and remains significant until a buffer size of about 2 to 4 MiB and it also explains the drop in the throughput for the 32 MiB buffer size we saw in the last test). However, the absolute overhead is larger for the Intel GPU at small sizes though the overhead on the Intel UHD 640 seems to become negative starting from 512 KiB buffer... This overhead is a little too much for us although it could be slightly better in the real code, where we'll only be waiting for the final events rather than all the intermediate ones and the time per kernel is also higher (both because we are doing more computation and that the final transfer to the main memory has a much lower bandwidth). We should do more tests that are more closely mapped to the real workload including fewer event callbacks and doing it only on the result of clEnqueueMapBuffer. We'll also try to see if waiting on the event in another thread makes any difference...

  2. Overhead of waiting for events

    Since the event callback seems to cause a lot of overhead, let's see if waiting on the events have a similar effect. We'll use exactly the same code as before but replace the event callbacks with clWaitForEvents called in a loop on a different thread. This time we have only one waiting variant and it corresponds to CL_COMPLETE in the previous test.

    The full code for the test can be found in opencl-wait-overhead.cpp and the results can be found under data/cl-wait-overhead. The test results are plotted below in the same manner as the previous test.

    1. Intel UHD 530

      Timing

    2. Intel UHD 640

      Timing

    3. AMD Radeon RX 5500 XT

      Timing

    It seems that the overhead on the Intel GPU are roughly the same or maybe slightly lower, but it's almost completely gone on the AMD GPU (basically negligible). I assume this tell us something about the implementation in the driver but I'm too lazy to figure that out. If I got time in the future, I may post this as an improvement request to the ROCm repo, after doing some more measurements (e.g. does the two give different time delay). For now, we may prefer using waiting from a different thread which isn't really harder for us to implement, or even on the same thread with polling (maybe by checking event state, though it's possible that polling the event state has a similar overhead). For now, let's see if adding more work as well as transfering the buffer to the host has any effect on the overhead.

  3. Event overhead with more complex dependency structure.

    The structure we'll test will be very similar to what we use on the CPU. We'll also organize the works into "workers" just like the CPU case, although they do not correspond to actual worker threads and are just chains of works that depend on the previous one (for phase computation in the real code). The worker number is the maximum amount of works that can in principle be done in parallel since only one work in each dependency chains can be worked on. OTOH, the minimum amount of works that can be done in parallel is the number of workers that takes no input from other workers.

    As an aside, we could also schedule the works on the CPU in a similar fashion (roughly equivalent to inventing an OpenCL-like API). The biggest challenge would be to write an efficient queue to track the dependencies and allow each work to be distributed to workers in a fair and efficient manner. This should allow us to "oversubstribe" the CPU by creating more logical "workers" than there are CPU cores. If we create the same number of logical workers with no input as the number of CPU cores, we'll able to guarantee that all CPU's always have some work to do at all time. The memory throughput requirement would be higher since now each logical workers instead of each physical workers are going to be writing to the memory so each buffer needs to be slightly smaller in order to fit in the L3 cache. The advantage of this is that we can have very similar scheduling logic/code for the GPU and CPU path and it might also simplify the issue of adding/removing channels.

    Anyway, back to this test. We'll test cases with 7 and 28 workers in the configuration 7 = 1 + 2 + 2 * 2 and 28 = 1 + 3 + 3 * 2 + 3 * 2 * 3. The output of the final kernel is outputted to a buffer with host pointer and we enqueue a map buffer after the kernel. We'll test with or without the callback and wait on the event from the map buffer. Due to an initial coding error, we'll also test the case when we unmap the buffer vs not. We'll map it at most 512 times so the leaking of map count should not be a problem here.

    The full code for the test can be found in opencl-event-overhead-2.cpp and the results can be found under data/cl-event-overhead-2. (Not very creative naming for the test I know... Run out of ideas on the naming...)

    The test results are plotted below. The left plots show the timing under different conditions with the legend marking the exact condition with (unmap) or without unmapping the buffer, with (wait) or without waiting for the event, and with (cb) or without registering a callback on the event. The worker number is shown in the title. The right plot shows the difference between each line in the left plot with the minimum averaged time for each buffer sizes (which may or may not be the same condition for each buffer sizes). This tells us what's the maximum overhead in each case.

    1. Intel UHD 530

      Timing

    2. Intel UHD 640

      Timing

    3. AMD Radeon RX 5500 XT

      Timing

    In general the relative overhead is much smaller also it is still significant for smaller worker numbers (i.e. fewer works per event) for event callback on the AMD GPU.

    On the Intel UHD 530, there's almost no observable overhead at all though unmapping the buffer does seem to cause some "outlier" points with long delay. This trend becomes more consistent on the Intel UHD 640 where unmapping the buffer seems to have a fixed relative overhead (maybe the driver runs a loop to flush the cache or something?). There's still no observable effect with the event callback or wait probably because the overhead of roughly 10 us we've seen before is too small to be seen now.

    On the AMD GPU, the unmapping has almost zero effect, which is nice since we can write proper code without worrying about this. Waiting for the event seems to cause a pretty small overhead (two to three us) that we'll just ignore for now... The main overhead still seem to come from event callback and it also seems to depend on the buffer size in a non-trivial way. It seems to start as roughly a fixed amount for small buffer sizes then increases slightly between 4 MiB and 15 MiB total buffer sizes (it's slightly different between the 7 and 28 worker ones so it's not purely a function of work sizes) and then saturates at about 0.4 ms for larger buffer sizes. Still not really sure what it is caused by and hopefully we can ignore it by using clWaitForEvents.