Skip to content

Ring Buffer Between Two Threads

Yichao Yu edited this page Jan 5, 2021 · 30 revisions

For generating the output, the data flow is unidirectional and converging, i.e. we are always passing data from one computation to the next without any feedback and each computation only has at most one consumer. Because of this, we only need thread synchronizations between two threads at a time and only need to pass data from one to the next.

In this page, we'll test what is the best strategy to do this. In general, we want to make sure the workers are always doing useful work (or if there is an imbalance in the workload the one with more workload should never be idling). This means that,

  1. The threads are not spending time waiting for one another.

    There should be enough buffer between each worker so that time jitter in one worker does not directly affect another.

  2. The threads are not fighting with each other.

    We need to minimize the bus traffic and flip-flop caused by the synchronization. These could cause one thread to spend a lot of time trying to do synchronization even though both sides are actually ready to go.

More concretely, here is a list of issues/options to investigate.

  1. Whether a worker is memory bound or computational bound.

    Note that here, "computation" includes all the works that is not used for synchronizing with the other thread. It might include real computation, I/O, or synchronization with a third thread. What matters is whether the thread is saturating the bandwidth of a memory channel.

    Based on what we measured with a single thread, I would assume that we are most likely not going to be memory bound unless the buffer is in the main memory. OTOH, this might change if a thread is consuming the output from more than one threads.

  2. The balance of the workload on the producer and consumer side.

    Of course we want the workload to be as balanced as possible but it'll never be perfect. It'll be interesting to see how the workload balance affects the performance. Due to the symmetry of the ring buffer, I would expect the case where the producer or the consumer being faster to be very similar to each other.

  3. Buffer size

    A larger buffer means less probability of conflict. OTOH, we would most likely still want the buffer to be in the L3 cache to maximize the memory bandwidth. (Do note that if we are compute bound overall we might get away with always hitting the main memory. This might be the case since even with 1.25 GS/s sampling rate we only need 5 GB/s of float32 which is within the bandwidth limit.)

  4. Frequency and granularity of buffer synchronization

    For a ring buffer, each side owns a section of the buffer and will pass it to the side when it finishes working on it (whether it being read or write). Two related questions are how often this synchronization should be done and the basic unit of data to be passed at one time.

    More frequent synchronization allows minimizing the wait time on the other thread but also causes more slowdown on the current thread. Similarly, a finer granularity allows more flexible synchronization strategy and also more frequent synchronization. OTOH, with a very coarse granularity, say if the whole buffer is split into a few different blocks which are synchronized one at a time, we could simplify the synchronization logic and potentially even use different flags in the memory to control the ownership of the blocks to reduce contention.

  5. Back-off strategy

    As in a spin lock, when one thread fails to acquire a new section of memory in the ring buffer, it needs to back off from checking again to reduce bus contention and to allow the other thread to release the buffer. We may need to implement an exponential back-off strategy using pause on x86. This may not be necessary on ARM when using the wfe and sev pair.

  6. Cache hint / manipulation

    Before handing off memory from one thread to another, we in principle want to make sure that it is not in a modified state in our private cache. This is ideally done with a "push for sharing" instruction which will be coming to the next generation of server CPUs from Intel but is unfortunately not available at the moment. The next best option may be the clwb instruction which should be available on Skylake-X (i.e. "Skylake (server)") as well as Zen2. However, it is a mandatory control instead of a hint so it may actually have a negative impact on the performance. On ARM, dc cvac (AArch64) or dccmvac (AArch32) should have a similar function to clwb on x86.

Note that we'll only test two threads here so the communication channel between them can use the full memory bandwidth available to a core. This will likely change when more threads are added.

  1. First, we'll test the case where there's no computation on either the consumer or the producer thread. This should tell us the maximum memory bandwidth available to us. We'll also test the case where no actual memory is accessed (i.e. simply passing the synchronization counter back and forth) to see what's the maximum stress we can put on the synchronization itself.

    For the backoff strategy, we use a loop of pause instructions on x86. As we'll see in the tests below, for most reasonable buffer size, the synchronization isn't the most time consuming part so we saw minimum differences between backoff strategies. We currently use a start point of 8 pauses with each iteration increasing the count by 1/4.

    I tested clwb, clflush and clflushopt on the machines that support them. All seems to yield much worse performance and much higher cache misses than without. I guess we'll basically have to wait for cldemote to have any useful cache hint on the producer thread.

    All the tests run with the producer thread pinned to core 0 and the consumer thread pinned to core 1. I think this does not matter on current Intel CPUs since the core to core delays are fairly constant. AMD (and likely ARM) CPUs may have clusters within the package/die and one may need to be more careful when connecting threads together.

    We use two different implementations of the ring buffer. First is BlockRing, which only allows passing "blocks" of memory at ones and the ownership of each block is recorded as a separate flag (aligned to 64 bytes so that each flag is in a different cache line to reduce false sharing). Second is DataPipe, which allows the producer/consumer to transfer an arbitrary number of elements at a time. The ownership of the memory is recorded by two read/write pointers with optimization to minimize conflicting access.

    In addition to the normal tests that use the buffer to do memory transfer (with the producer writes to the buffer and the consumer reads from it), we also do a test at each condition without any memory read/write on the buffer. This allows us to estimate the overhead due to the synchronization or when the two threads fight each other.

    We vary the granularity of the data transfer by changing the number of blocks. For BlockRing, this is simply the number of blocks in the ring. For DataPipe, the effective block size is the maximum allowed transfer size, and due to an implementation detail, the producer must remain at least one element behind the consumer. Because of this, the BlockRing transfer will have exactly one transfer per block (shown as R/W per Block in plots below) whereas this will be equal or greater than one for the DataPipe implementation, which mostly happens when the producer catches up to the consumer. Other terminology we'll use in the data below included,

    • Pipe Stall

      When a read/write attempt does not immediately return enough memory.

    • Sync

      Attempts made during pipe stall (including the first one) to obtain new memory from the other thread. (It is so named since for DataPipe each additional attempt will cause the thread to sync its copy of the read/write pointers by reading from the shared one.) Each sync will trigger an increase in the backoff (by the factor of 1.25 mentioned above).

    We also vary the total buffer sizes. Note that since we would like to make use of all cores, it is the buffer sizes that are below L3/core that's the most interesting to us.

    The code for the test is in cpu-two-threads-ring-buffer.cpp.

    1. Intel Core i9-10885H

      For the raw data, see these CSV files when using BlockRing (for read and write) with the corresponding no memory access test (for read and write), and using DataPipe (for read and write) with the corresponding no memory access test (for read and write).

      First, let's look at the performance.

      Performance

      The solid lines show the time it takes (per KiB of data) when the data is actually passed through the buffer whereas the dashed lines show the corresponding condition when there's no memory access on the buffer.

      Looking at the overhead for the no-memory-access data for small buffer sizes. It seems that the overhead for BlockRing depends more strongly on the number of blocks (potentially due to the larger number of flag variables) whereas the overhead for DataPipe is fairly constant. For block number up to about 32, the BlockRing seems to have a lower overhead. Possibly because of/related to this, the performance for the BlockRing is slightly better for a wider range of buffer sizes below L3/core even though the two have roughly equal performance for 1 to 4 MiB buffer sizes. We can also see the drop in performance when the buffer gets bigger than 4 MiB before getting into the main memory as we've previously seen in the write bandwidth tests.

      The maximum achieved bandwidth is about 25 B/ns which is lower than the 34 B/ns measured for the write bandwidth to 34 B/ns but isn't that far from the 28 B/ns when writing to a larger buffer in L3. Maybe the drop in performance in both cases are due to interaction with other cores or L3 slices or memory controller or cache associativity.

      Next, let's check the values for some other counters.

      For BlockRing

      BlockRing

      For DataPipe

      DataPipe

      The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.

      It's unsurprising that there's cache misses when the buffer size equals to/larger than L3. However, what's interesting is that the cache miss starts as early as 1/2 L3 size, (also about the same 4 to 5 MiB size threshold when we observed an decrease in L3 bandwidth before). The read (but not really the write) miss rate also depends significantly on the block number with smaller blocks causing less miss. This trend isn't surprising since a smaller block means that it can be read by the consumer before being evicted from the cache but I don't think this should happen when the whole buffer can fit in L3. Maybe the cache miss counter also includes some event when the data has to be transferred from a different L3 slice? Or maybe it's something related to how cache associativity works at L3 level.

      For the R/W per block, the one for BlockRing is exactly 1 as expected. It is higher, and more so for fewer blocks, for DataPipe. The write count seems to be consistently higher than the read count, suggesting that the producer is more likely to catch up to the consumer. This is despite the higher cache miss rate for the producer which is a bit strange. The trend also seems to be fairly regular, with the write one for large buffer size approaches the read/normal number for half the block number (twice the block size).

      The pipe stall percentage clearly shows three different regimes.

      1. When the buffer is small, the throughput is limited by the synchronization overhead.

      2. When the buffer is big and there are a lot of cache misses, the throughput is limited by the main memory bandwidth.

      3. In between the two cases above, the throughput is limited by the L3 bandwidth. This is also the case we care about the most.

      Somehow for case 1 and 2 the producer can catch up with the consumer whereas for case 3 the consumer catches up with the producer most of the time. Now I do understand that for intermediate size where the synchronization overhead is insignificant, there is an advantage for the consumer thread when it can consume "hot" data produced that is still in the cache which might help it to catch up. However, I don't really see a good reason for the producer to consistently catch up in the other case (maybe because stores don't have to wait for the results??), i.e. why would the synchronization and cache miss overhead hit the consumer thread harder than the producer thread.

      Comparing the two implementations regarding the stall percentage. The DataPipe appears to have a lower stall percentage for case 3. However, since it issues more read/write, the total number of stalls is actually about the same. On the other hand, the write stall seems to be much higher for DataPipe in this region whereas it's almost zero for BlockRing. It also seems that for 8 to 64 blocks, there is a fairly wide range of buffer sizes for BlockRing where there are almost no stalls at all.

      Finally, we can look at the sync per stalls. There's a fairly strong anti correlation between the sync per stall and the stall count so it's likely that most of the stalls actually have a fairly small number of sync but there are a few outliers that hang for a long time. This could be related to other things on the machine, or the initial part so I'll not pay too much attention to it as long as it doesn't affect the performance too much.

    2. Intel Core i7-6700K

      For the raw data, see these CSV files when using BlockRing (for read and write) with the corresponding no memory access test (for read and write), and using DataPipe (for read and write) with the corresponding no memory access test (for read and write).

      First, the performance.

      Performance

      The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.

      And then the more detailed numbers.

      For BlockRing

      BlockRing

      For DataPipe

      DataPipe

      The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.

      Overall very similar just faster. The average sync per stall is lower. The R/W per block for DataPipe seems cleaner and shows the switch over for large buffer sizes more clearly (still not exactly sure why...)

    3. Intel Core i9-7900X

      For the raw data, see these CSV files when using BlockRing (for read and write) with the corresponding no memory access test (for read and write), and using DataPipe (for read and write) with the corresponding no memory access test (for read and write).

      First, the performance.

      Performance

      The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.

      And then the more detailed numbers.

      For BlockRing

      BlockRing

      For DataPipe

      DataPipe

      The solid lines in each of the plots correspond to the number for the read/consumer thread, whereas the dashed line is for the write/producer thread.

      There are some differences this time. We did not measure to large enough buffer size for the producer to catch up to the consumer again this time. The pipe stalls for the producer also ended at much lower buffer sizes but seems to go back up for a little bit for 4 to 16 blocks.

    In general, using a buffer size that's about or slightly lower than L3/core, it seems that the BlockRing gives slightly more consistent performance. A block count of about 16 also seems to be about the best tradeoff. Another slight advantage when using the BlockRing is that this way of managing the buffer is more likely what we need to do for the GPU since we have much less flexibility when scheduling a job there. Using it on the CPU would make the two more consistent though there may not be too much code sharing.

  2. Next, let's run the tests between different CPU pairs to see if there's any structure in the cpu-cpu bandwidth.

    For all the tests we'll use 1 MiB buffer and 16 blocks.

    The code for the test is the same file as above but updated to allow changing the core we pin the thread to. cpu-two-threads-ring-buffer.cpp.

    1. Intel Core i9-10885H

      For the raw data, see these CSV files when using BlockRing (for read and write), and using DataPipe (for read and write).

      Performance

    2. Intel Core i7-6700K

      For the raw data, see these CSV files when using BlockRing (for read and write), and using DataPipe (for read and write).

      Performance

    3. Intel Core i9-7900X

      For the raw data, see these CSV files when using BlockRing (for read and write), and using DataPipe (for read and write).

      Performance

    I don't really see any obvious structure which is what we expect, and the variation seems to be about 10 to 15 % so it doesn't matter that much. It does seem that writing from core with lower number to higher number somehow has a slightly better bandwidth especially between the "middle" cores (the upper right part in the middle is generally brighter than the lower left part) though this isn't very significant for the i9-7900X. Maybe it has something to do with the ring bus.

  3. Finally, let's add computation to the test.

    The producer thread will compute the sum of a number of sine functions (channels) (amp * sin(freq * t), same as the one used in single thread performance test). The consumer thread will read from the ring buffer and compute the sum of it with a potentially different number of sine functions. We will record the performance as the time it takes to compute each sine function on the threads with more channels.

    We'll test using buffer size around and below L3/core and we'll use only the BlockRing implementation since it gives a more consistent result in this range.

    Optionally, each thread may have a local buffer for the sums of sine functions. When this is enabled, instead of using pause for back off, the thread will first fill this local buffer with computational results which will be used later to write to or add with the ring buffer to produce the final results. For simplicity, we fixed the local buffer size to be 32 KiB and the thread will check for ring buffer availability once after filling the first 16 KiB of it.

    This test should be very closed to what we'll actually do to generate the data. The real one will need more logic on each thread to generate the parameters and to accumulate the time/phase. There'll also be more threads and at least some threads will need to consume and produce at the same time and potentially consume from more than one other thread.

    We run all the tests by writing from core 0 to core 1.

    The code for the test is in cpu-two-threads-ring-buffer-compute.cpp.

    For the plots below, the performance/throughput plots (left) uses solid lines for no local buffer and dashed lines for 32 KiB of local buffer. The pipe stall plots (right) uses solid lines for read with no local buffer, dotted lines for write with no local buffer, dashed lines for read with 32 KiB of local buffer, and dashed dotted line for write with 32 KiB of local buffer. The colors are the same as what is used above.

    The red dotted horizontal line marked the performance calculated from the pure computation test. Note that due to a frequency change on i7-6700K the performance number for the red line there is scaled accordingly.

    1. Intel Core i9-10885H

      CPU frequency governor: powersave

      For the raw data, see these CSV files for read and write.

      With equal amount of computation on the producer and consumer.

      Balanced

      With 3 times more computation on the producer than the consumer.

      Write Limited

      With 3 times less computation on the producer than the consumer.

      Read Limited

    2. Intel Core i7-6700K

      CPU frequency governor: powersave, however, the frequency appears to be 4.6 GHz rather than the 4.11 GHz observed in the previous test. The performance limit below has been scaled accordingly though it still seems to be off a little...

      For the raw data, see these CSV files for read and write.

      With equal amount of computation on the producer and consumer.

      Balanced

      (Note that the drop in the performance for 8 channels does not seem to be caused by the CPU frequency change.)

      With 3 times more computation on the producer than the consumer.

      Write Limited

      With 3 times less computation on the producer than the consumer.

      Read Limited

    3. Intel Core i9-7900X

      For the raw data, see these CSV files for read and write).

      With equal amount of computation on the producer and consumer.

      Balanced

      With 3 times more computation on the producer than the consumer.

      Write Limited

      With 3 times less computation on the producer than the consumer.

      Read Limited

    There are very few things that are worth discussing for each CPU separately so I'll just do a summary for all the ones together. The performance for a higher number of channels is much closer to the limit. This is especially true for smaller (about 512 KiB) buffers maybe because of better usage of cache? (Still, things have to go through L3 so I'm not sure) The asymmetric one gets closer to the limit likely because of the larger number of channels on one thread.

    The local buffer does seem to reduce the stall in the pipe though the effect on the overall performance is minimum. There doesn't seem to be too many stalls in general, which is good. Maybe we'll include a small one to help with some edge cases.

In general, there isn't anything too surprising and we are fairly close to the computational limited throughput overall, which is the regime we want to be in. Unless there's some catastrophic interference when we add more pipes to each thread we should be able to sustain this performance to all cores. The use of complex AVX512 instructions on all threads will of course downclock the CPUs on this generation, but we should still be able to generate more than 100 traps at 625 MS/s on the CPU only.

Next we'll add more threads and test combining the computation on multiple threads to a single output.