-
Notifications
You must be signed in to change notification settings - Fork 0
Single Thread CPU Performance
We can measure the performance of the computation on the CPU
in order to get a reference for what the performance should be when we have multiple threads.
We do this in a way to make sure the CPU does all the computation
without actually writing the result to memory.
The computation we do is a simple a * sin(b * c)
.
This is very similar to the computation for generating the sine wave but does not include
the steps to forward the time/phase or adding the sine waves together.
This is done using the kernels (*::calc_dry
) in helpers/cpu_kernel.h
The test itself is in cpu-single-thread-dry.cpp
For the accuracy of the sin
functions used in this test
and how it compares to other implementations
see the computation accuracy test.
-
Intel Core i9-10885H
CPU frequency governor: powersave
Kernel: AVX2
Compiler: GCC 10.2.0
Kernel disassembly (with dependency chain depth labeled):
0x1b08: vmulps %ymm1,%ymm0,%ymm3 ; 1 0x1b0c: vcvtps2dq %ymm3,%ymm5 ; 2 0x1b10: vcvtdq2ps %ymm5,%ymm4 ; 3 0x1b14: vsubps %ymm4,%ymm3,%ymm3 ; 4 0x1b18: vpslld $0x1f,%ymm5,%ymm4 ; 3 0x1b1d: vmulps %ymm3,%ymm3,%ymm6 ; 5 0x1b21: vpxor %ymm3,%ymm4,%ymm4 ; 5 0x1b25: vmovaps %ymm6,%ymm3 ; 5* (assume the move is free) 0x1b29: vfmadd132ps %ymm9,%ymm8,%ymm3 ; 6 0x1b2e: vfmadd132ps %ymm6,%ymm7,%ymm3 ; 7 0x1b33: vmulps %ymm4,%ymm6,%ymm6 ; 6 0x1b37: vfmadd132ps %ymm6,%ymm4,%ymm3 ; 8 0x1b3c: vmulps %ymm3,%ymm2,%ymm3 ; 9 0x1b40: add $0x1,%rax 0x1b44: cmp %rax,%rsi 0x1b47: jne 0x1b08
Calculating
2^35
elements in total, each element takes0.216
ns,2.00
insts and0.772
cycles as measured in the program. This corresponds to a frequency of3.57
GHz and an IPC of2.59
. Within each loop, there are13
AVX/AVX2 instructions (including avpxor
which has a higher throughput and avmovaps
which is likely free) which forms a long dependency chain and each iteration takes approximately6.18
cycles to execute which is pretty good given that there are only two floating point vector ALU per core. -
Intel Core i7-6700K
CPU frequency governor: powersave
Kernel: AVX2
Compiler: GCC 10.2.0
Kernel: same as machine 1
Calculating
2^35
elements in total, each element takes0.190
ns,2.00
insts and0.780
cycles as measured in the program. This corresponds to a frequency of4.11
GHz and an IPC of2.56
. Each loop iteration is the same as machine 1 but it takes a slightly longer6.24
cycles compared to i9-10885H. -
Intel Core i9-7900X
CPU frequency governor: performance
Kernel: AVX512
Compiler: GCC 7.5.0
Kernel disassembly (with dependency chain depth labeled):
0x18b0: vmulps %zmm1,%zmm0,%zmm3 ; 1 0x18b6: vcvtps2dq %zmm3,%zmm5 ; 2 0x18bc: vcvtdq2ps %zmm5,%zmm4 ; 3 0x18c2: vsubps %zmm4,%zmm3,%zmm3 ; 4 0x18c8: vpslld $0x1f,%zmm5,%zmm4 ; 3 0x18cf: vmulps %zmm3,%zmm3,%zmm6 ; 5 0x18d5: vpxorq %zmm3,%zmm4,%zmm4 ; 5 0x18db: vmovaps %zmm6,%zmm3 ; 5* (assume the move is free) 0x18e1: vfmadd132ps %zmm9,%zmm8,%zmm3 ; 6 0x18e7: vfmadd132ps %zmm6,%zmm7,%zmm3 ; 7 0x18ed: vmulps %zmm4,%zmm6,%zmm6 ; 6 0x18f3: vfmadd132ps %zmm6,%zmm4,%zmm3 ; 8 0x18f9: vmulps %zmm2,%zmm3,%zmm3 ; 9 0x18ff: add $0x1,%rax 0x1903: cmp %rax,%rsi 0x1906: jne 0x18b0
Calculating
2^36
elements in total, each element takes0.104
ns,1.00
insts and0.415
cycles as measured in the program. This corresponds to a frequency of3.99
GHz and an IPC of2.41
. The loop body is also the same as previous ones except for the change fromymm
s tozmm
s. Each iteration takes approximately6.64
cycles to execute which is slightly worse still than the i7-6700K but is still pretty good.
Other than the obvious frequency difference and the vector width, the loop body has exactly the same (kind and number of) instructions on each machine, thanks to an optimization inspired by clang (i.e. using a shift to generate the mask, instead of using compare). Each new generation of *Lake processor does seem to get slightly better at increasing the IPC.
Based on the limitation from execution ports, I believe each iteration
should take at least 5.5
cycles (the 11
more expensive AVX/AVX2/AVX512 instructions
need to fight for the two execution ports).
The difference between the observed timing and this limit is probably related to
the throughput of the most expensive FMA instructions and the long dependency chain.
We can potentially break down the dependency chain by changing how the polynomial is evaluated
but I'm happy enough for now. (If this is the only limit we hit
we should be able to generate about 134
traps on the i9-7900X).
After measuring the pure computational throughput, let's measure the pure memory throughput.