Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Benchmark: add a concurrent fixed work benchmark
As long as you use number of threads <= number of CPUs the amount of time taken by 'fixed work' should be the same. However it may be more if there is overhead in dispatching work from the OCaml side. Results on `Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz` (8 CPUs): ``` │ fixedwork/concurrent fixedwork:1 │ 0.0000 mjw/run│ 3.6585 mnw/run│ 12831863.4328 ns/run│ │ fixedwork/concurrent fixedwork:16 │ 0.0000 mjw/run│ 6.5217 mnw/run│ 45015232.3024 ns/run│ │ fixedwork/concurrent fixedwork:2 │ 0.0000 mjw/run│ 3.8462 mnw/run│ 14234923.1372 ns/run│ │ fixedwork/concurrent fixedwork:4 │ 0.0000 mjw/run│ 4.2857 mnw/run│ 16573979.6790 ns/run│ │ fixedwork/concurrent fixedwork:8 │ 0.0000 mjw/run│ 4.8387 mnw/run│ 21940491.7677 ns/run│ │ fixedwork/fixedwork │ 0.0000 mjw/run│ 2.6316 mnw/run│ 12746205.5882 ns/run│ ``` Overhead with 8 is quite significant already: ~70%, and even 4 threads has 30% overhead. This machine had turbo enabled. After disabling turbo (and working around the bug in `xenpm` which requires rerunning `set-scaling-governor` after `disable-turbo-mode`): ``` ╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮ │name │ major-allocated │ minor-allocated │ monotonic-clock │ ├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤ │ fixedwork/concurrent fixedwork:1 │ 0.0000 mjw/run│ 3.8462 mnw/run│ 13525498.9640 ns/run│ │ fixedwork/concurrent fixedwork:16 │ 0.0000 mjw/run│ 7.1429 mnw/run│ 49291752.0987 ns/run│ │ fixedwork/concurrent fixedwork:2 │ 0.0000 mjw/run│ 3.8462 mnw/run│ 14284943.0644 ns/run│ │ fixedwork/concurrent fixedwork:4 │ 0.0000 mjw/run│ 4.5455 mnw/run│ 19029750.6638 ns/run│ │ fixedwork/concurrent fixedwork:8 │ 0.0000 mjw/run│ 5.1724 mnw/run│ 24823535.6315 ns/run│ │ fixedwork/fixedwork │ 0.0000 mjw/run│ 2.7273 mnw/run│ 13468350.0593 ns/run│ ╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯ ``` This machine isn't really suitable for benchmarking how XAPI scales: thread switching overhead is too high, and is not what is seen on the other machine. Using 16 workers results in a massive slowdown as expected. Results on `Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz`: ``` │ fixedwork/concurrent fixedwork:1 │ 0.0000 mjw/run│ 4.2857 mnw/run│ 18232376.7910 ns/run│ │ fixedwork/concurrent fixedwork:16 │ 0.0000 mjw/run│ 4.5455 mnw/run│ 19158780.9472 ns/run│ │ fixedwork/concurrent fixedwork:2 │ 0.0000 mjw/run│ 4.2857 mnw/run│ 18165547.7630 ns/run│ │ fixedwork/concurrent fixedwork:4 │ 0.0000 mjw/run│ 4.5455 mnw/run│ 18204930.7199 ns/run│ │ fixedwork/concurrent fixedwork:8 │ 0.0000 mjw/run│ 4.5455 mnw/run│ 18293562.4699 ns/run│ │ fixedwork/fixedwork │ 0.0000 mjw/run│ 3.0612 mnw/run│ 18095615.8910 ns/run│ ``` Using 16 workers is fine here, this Dom0 has 16 vCPUs, and the overhead is ~5% compared to the single threaded case. Signed-off-by: Edwin Török <[email protected]>
- Loading branch information