Skip to content

Single Thread CPU Performance

Yichao Yu edited this page Jan 8, 2021 · 16 revisions

We can measure the performance of the computation on the CPU in order to get a reference for what the performance should be when we have multiple threads. We do this in a way to make sure the CPU does all the computation without actually writing the result to memory. The computation we do is a simple a * sin(b * c). This is very similar to the computation for generating the sine wave but does not include the steps to forward the time/phase or adding the sine waves together.

This is done using the kernels (*::calc_dry) in helpers/cpu_kernel.h The test itself is in cpu-single-thread-dry.cpp

For the accuracy of the sin functions used in this test and how it compares to other implementations see the computation accuracy test.

  1. Intel Core i9-10885H

    CPU frequency governor: powersave

    Kernel: AVX2

    Compiler: GCC 10.2.0

    Kernel disassembly (with dependency chain depth labeled):

    0x1b08:   vmulps %ymm1,%ymm0,%ymm3 ; 1
    0x1b0c:   vcvtps2dq %ymm3,%ymm5 ; 2
    0x1b10:   vcvtdq2ps %ymm5,%ymm4 ; 3
    0x1b14:   vsubps %ymm4,%ymm3,%ymm3 ; 4
    0x1b18:   vpslld $0x1f,%ymm5,%ymm4 ; 3
    0x1b1d:   vmulps %ymm3,%ymm3,%ymm6 ; 5
    0x1b21:   vpxor  %ymm3,%ymm4,%ymm4 ; 5
    0x1b25:   vmovaps %ymm6,%ymm3 ; 5* (assume the move is free)
    0x1b29:   vfmadd132ps %ymm9,%ymm8,%ymm3 ; 6
    0x1b2e:   vfmadd132ps %ymm6,%ymm7,%ymm3 ; 7
    0x1b33:   vmulps %ymm4,%ymm6,%ymm6 ; 6
    0x1b37:   vfmadd132ps %ymm6,%ymm4,%ymm3 ; 8
    0x1b3c:   vmulps %ymm3,%ymm2,%ymm3 ; 9
    0x1b40:   add    $0x1,%rax
    0x1b44:   cmp    %rax,%rsi
    0x1b47:   jne    0x1b08

    Calculating 2^35 elements in total, each element takes 0.216 ns, 2.00 insts and 0.772 cycles as measured in the program. This corresponds to a frequency of 3.57 GHz and an IPC of 2.59. Within each loop, there are 13 AVX/AVX2 instructions (including a vpxor which has a higher throughput and a vmovaps which is likely free) which forms a long dependency chain and each iteration takes approximately 6.18 cycles to execute which is pretty good given that there are only two floating point vector ALU per core.

  2. Intel Core i7-6700K

    CPU frequency governor: powersave

    Kernel: AVX2

    Compiler: GCC 10.2.0

    Kernel: same as machine 1

    Calculating 2^35 elements in total, each element takes 0.190 ns, 2.00 insts and 0.780 cycles as measured in the program. This corresponds to a frequency of 4.11 GHz and an IPC of 2.56. Each loop iteration is the same as machine 1 but it takes a slightly longer 6.24 cycles compared to i9-10885H.

  3. Intel Core i9-7900X

    CPU frequency governor: performance

    Kernel: AVX512

    Compiler: GCC 7.5.0

    Kernel disassembly (with dependency chain depth labeled):

    0x18b0:   vmulps %zmm1,%zmm0,%zmm3 ; 1
    0x18b6:   vcvtps2dq %zmm3,%zmm5 ; 2
    0x18bc:   vcvtdq2ps %zmm5,%zmm4 ; 3
    0x18c2:   vsubps %zmm4,%zmm3,%zmm3 ; 4
    0x18c8:   vpslld $0x1f,%zmm5,%zmm4 ; 3
    0x18cf:   vmulps %zmm3,%zmm3,%zmm6 ; 5
    0x18d5:   vpxorq %zmm3,%zmm4,%zmm4 ; 5
    0x18db:   vmovaps %zmm6,%zmm3 ; 5* (assume the move is free)
    0x18e1:   vfmadd132ps %zmm9,%zmm8,%zmm3 ; 6
    0x18e7:   vfmadd132ps %zmm6,%zmm7,%zmm3 ; 7
    0x18ed:   vmulps %zmm4,%zmm6,%zmm6 ; 6
    0x18f3:   vfmadd132ps %zmm6,%zmm4,%zmm3 ; 8
    0x18f9:   vmulps %zmm2,%zmm3,%zmm3 ; 9
    0x18ff:   add    $0x1,%rax
    0x1903:   cmp    %rax,%rsi
    0x1906:   jne    0x18b0

    Calculating 2^36 elements in total, each element takes 0.104 ns, 1.00 insts and 0.415 cycles as measured in the program. This corresponds to a frequency of 3.99 GHz and an IPC of 2.41. The loop body is also the same as previous ones except for the change from ymms to zmms. Each iteration takes approximately 6.64 cycles to execute which is slightly worse still than the i7-6700K but is still pretty good.

Other than the obvious frequency difference and the vector width, the loop body has exactly the same (kind and number of) instructions on each machine, thanks to an optimization inspired by clang (i.e. using a shift to generate the mask, instead of using compare). Each new generation of *Lake processor does seem to get slightly better at increasing the IPC.

Based on the limitation from execution ports, I believe each iteration should take at least 5.5 cycles (the 11 more expensive AVX/AVX2/AVX512 instructions need to fight for the two execution ports). The difference between the observed timing and this limit is probably related to the throughput of the most expensive FMA instructions and the long dependency chain. We can potentially break down the dependency chain by changing how the polynomial is evaluated but I'm happy enough for now. (If this is the only limit we hit we should be able to generate about 134 traps on the i9-7900X).

After measuring the pure computational throughput, let's measure the pure memory throughput.