-
Notifications
You must be signed in to change notification settings - Fork 0
OpenCL Event Overhead
-
Event callback overhead
We noticed some possible overhead when we enabled event callbacks on each of the work. Since we may actually want to use the event callback, let's see if the overhead is real and if it is managable.
In the previous test we enabled three event callbacks for every event but in the real code the most important one is the
CL_COMPLETE
event which tells us when we can push the data to the AWG driver. So let's measure the overhead with all (8) combinations of event callbacks enabled. The event callback types will be marked with a three-tuple in which the enable state of the callbacks are shown as0
or1
in the order ofCL_SUBMITTED
,CL_RUNNING
andCL_COMPLETE
events.The full code for the test can be found in
opencl-event-overhead.cpp
and the results can be found underdata/cl-event-overhead
.The test results are plotted below. Each lines corresponds to a different set of event callbacks enabled and are labeled using the three-tuple as mentioned above, The time per run is measured from the total time of 1024 repititions and each data point is measured about 150 to 200 times. The dots shows each individual measurement and the line shows the average.
-
Intel UHD 530
-
Intel UHD 640
-
AMD Radeon RX 5500 XT
Huh, so there is a pretty big overhead on both GPUs and the overhead also seems to increase the uncertainty of the timing as well. Relatively speaking, the overhead on the AMD GPU is much larger (3x for small sizes and remains significant until a buffer size of about
2
to4 MiB
and it also explains the drop in the throughput for the32 MiB
buffer size we saw in the last test). However, the absolute overhead is larger for the Intel GPU at small sizes though the overhead on the Intel UHD 640 seems to become negative starting from512 KiB
buffer... This overhead is a little too much for us although it could be slightly better in the real code, where we'll only be waiting for the final events rather than all the intermediate ones and the time per kernel is also higher (both because we are doing more computation and that the final transfer to the main memory has a much lower bandwidth). We should do more tests that are more closely mapped to the real workload including fewer event callbacks and doing it only on the result ofclEnqueueMapBuffer
. We'll also try to see if waiting on the event in another thread makes any difference... -
-
Overhead of waiting for events
Since the event callback seems to cause a lot of overhead, let's see if waiting on the events have a similar effect. We'll use exactly the same code as before but replace the event callbacks with
clWaitForEvents
called in a loop on a different thread. This time we have only one waiting variant and it corresponds toCL_COMPLETE
in the previous test.The full code for the test can be found in
opencl-wait-overhead.cpp
and the results can be found underdata/cl-wait-overhead
. The test results are plotted below in the same manner as the previous test.-
Intel UHD 530
-
Intel UHD 640
-
AMD Radeon RX 5500 XT
It seems that the overhead on the Intel GPU are roughly the same or maybe slightly lower, but it's almost completely gone on the AMD GPU (basically negligible). I assume this tell us something about the implementation in the driver but I'm too lazy to figure that out. If I got time in the future, I may post this as an improvement request to the ROCm repo, after doing some more measurements (e.g. does the two give different time delay). For now, we may prefer using waiting from a different thread which isn't really harder for us to implement, or even on the same thread with polling (maybe by checking event state, though it's possible that polling the event state has a similar overhead). For now, let's see if adding more work as well as transfering the buffer to the host has any effect on the overhead.
-
-
Event overhead with more complex dependency structure.
The structure we'll test will be very similar to what we use on the CPU. We'll also organize the works into "workers" just like the CPU case, although they do not correspond to actual worker threads and are just chains of works that depend on the previous one (for phase computation in the real code). The worker number is the maximum amount of works that can in principle be done in parallel since only one work in each dependency chains can be worked on. OTOH, the minimum amount of works that can be done in parallel is the number of workers that takes no input from other workers.
As an aside, we could also schedule the works on the CPU in a similar fashion (roughly equivalent to inventing an OpenCL-like API). The biggest challenge would be to write an efficient queue to track the dependencies and allow each work to be distributed to workers in a fair and efficient manner. This should allow us to "oversubstribe" the CPU by creating more logical "workers" than there are CPU cores. If we create the same number of logical workers with no input as the number of CPU cores, we'll able to guarantee that all CPU's always have some work to do at all time. The memory throughput requirement would be higher since now each logical workers instead of each physical workers are going to be writing to the memory so each buffer needs to be slightly smaller in order to fit in the
L3
cache. The advantage of this is that we can have very similar scheduling logic/code for the GPU and CPU path and it might also simplify the issue of adding/removing channels.Anyway, back to this test. We'll test cases with
7
and28
workers in the configuration7 = 1 + 2 + 2 * 2
and28 = 1 + 3 + 3 * 2 + 3 * 2 * 3
. The output of the final kernel is outputted to a buffer with host pointer and we enqueue a map buffer after the kernel. We'll test with or without the callback and wait on the event from the map buffer. Due to an initial coding error, we'll also test the case when we unmap the buffer vs not. We'll map it at most512
times so the leaking of map count should not be a problem here.The full code for the test can be found in
opencl-event-overhead-2.cpp
and the results can be found underdata/cl-event-overhead-2
. (Not very creative naming for the test I know... Run out of ideas on the naming...)The test results are plotted below. The left plots show the timing under different conditions with the legend marking the exact condition with (
unmap
) or without unmapping the buffer, with (wait
) or without waiting for the event, and with (cb
) or without registering a callback on the event. The worker number is shown in the title. The right plot shows the difference between each line in the left plot with the minimum averaged time for each buffer sizes (which may or may not be the same condition for each buffer sizes). This tells us what's the maximum overhead in each case.-
Intel UHD 530
-
Intel UHD 640
-
AMD Radeon RX 5500 XT
In general the relative overhead is much smaller also it is still significant for smaller worker numbers (i.e. fewer works per event) for event callback on the AMD GPU.
On the Intel UHD 530, there's almost no observable overhead at all though unmapping the buffer does seem to cause some "outlier" points with long delay. This trend becomes more consistent on the Intel UHD 640 where unmapping the buffer seems to have a fixed relative overhead (maybe the driver runs a loop to flush the cache or something?). There's still no observable effect with the event callback or wait probably because the overhead of roughly
10 us
we've seen before is too small to be seen now.On the AMD GPU, the unmapping has almost zero effect, which is nice since we can write proper code without worrying about this. Waiting for the event seems to cause a pretty small overhead (two to three
us
) that we'll just ignore for now... The main overhead still seem to come from event callback and it also seems to depend on the buffer size in a non-trivial way. It seems to start as roughly a fixed amount for small buffer sizes then increases slightly between4 MiB
and15 MiB
total buffer sizes (it's slightly different between the7
and28
worker ones so it's not purely a function of work sizes) and then saturates at about0.4 ms
for larger buffer sizes. Still not really sure what it is caused by and hopefully we can ignore it by usingclWaitForEvents
. -