From d303123c97667cded0f6d7b940e5addedff8a0a9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= <edwin.torok@cloud.com>
Date: Mon, 9 Oct 2023 17:07:06 +0100
Subject: [PATCH] Benchmark: add a concurrent fixed work benchmark
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

As long as you use number of threads <= number of CPUs the amount of time taken by 'fixed work' should be the same.
However it may be more if there is overhead in dispatching work from the OCaml side.

Results on `Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz` (8 CPUs):
```
│  fixedwork/concurrent fixedwork:1             │             0.0000 mjw/run│             3.6585 mnw/run│       12831863.4328 ns/run│
│  fixedwork/concurrent fixedwork:16            │             0.0000 mjw/run│             6.5217 mnw/run│       45015232.3024 ns/run│
│  fixedwork/concurrent fixedwork:2             │             0.0000 mjw/run│             3.8462 mnw/run│       14234923.1372 ns/run│
│  fixedwork/concurrent fixedwork:4             │             0.0000 mjw/run│             4.2857 mnw/run│       16573979.6790 ns/run│
│  fixedwork/concurrent fixedwork:8             │             0.0000 mjw/run│             4.8387 mnw/run│       21940491.7677 ns/run│
│  fixedwork/fixedwork                          │             0.0000 mjw/run│             2.6316 mnw/run│       12746205.5882 ns/run│
```

Overhead with 8 is quite significant already: ~70%, and even 4 threads has 30% overhead.
This machine had turbo enabled.

After disabling turbo (and working around the bug in `xenpm` which requires rerunning `set-scaling-governor` after `disable-turbo-mode`):
```
╭─────────────────────────────────────┬───────────────────────────┬───────────────────────────┬───────────────────────────╮
│name                                 │  major-allocated          │  minor-allocated          │  monotonic-clock          │
├─────────────────────────────────────┼───────────────────────────┼───────────────────────────┼───────────────────────────┤
│  fixedwork/concurrent fixedwork:1   │             0.0000 mjw/run│             3.8462 mnw/run│       13525498.9640 ns/run│
│  fixedwork/concurrent fixedwork:16  │             0.0000 mjw/run│             7.1429 mnw/run│       49291752.0987 ns/run│
│  fixedwork/concurrent fixedwork:2   │             0.0000 mjw/run│             3.8462 mnw/run│       14284943.0644 ns/run│
│  fixedwork/concurrent fixedwork:4   │             0.0000 mjw/run│             4.5455 mnw/run│       19029750.6638 ns/run│
│  fixedwork/concurrent fixedwork:8   │             0.0000 mjw/run│             5.1724 mnw/run│       24823535.6315 ns/run│
│  fixedwork/fixedwork                │             0.0000 mjw/run│             2.7273 mnw/run│       13468350.0593 ns/run│
╰─────────────────────────────────────┴───────────────────────────┴───────────────────────────┴───────────────────────────╯
```

This machine isn't really suitable for benchmarking how XAPI scales: thread switching overhead is too high, and is not what is seen on the other machine.

Using 16 workers results in a massive slowdown as expected.

Results on `Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz`:
```
│  fixedwork/concurrent fixedwork:1             │             0.0000 mjw/run│             4.2857 mnw/run│       18232376.7910 ns/run│
│  fixedwork/concurrent fixedwork:16            │             0.0000 mjw/run│             4.5455 mnw/run│       19158780.9472 ns/run│
│  fixedwork/concurrent fixedwork:2             │             0.0000 mjw/run│             4.2857 mnw/run│       18165547.7630 ns/run│
│  fixedwork/concurrent fixedwork:4             │             0.0000 mjw/run│             4.5455 mnw/run│       18204930.7199 ns/run│
│  fixedwork/concurrent fixedwork:8             │             0.0000 mjw/run│             4.5455 mnw/run│       18293562.4699 ns/run│
│  fixedwork/fixedwork                          │             0.0000 mjw/run│             3.0612 mnw/run│       18095615.8910 ns/run│
```

Using 16 workers is fine here, this Dom0 has 16 vCPUs, and the overhead is ~5% compared to the single threaded case.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
---
 ocaml/tests/bench/test_basics/ezbechamel_basics.ml | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/ocaml/tests/bench/test_basics/ezbechamel_basics.ml b/ocaml/tests/bench/test_basics/ezbechamel_basics.ml
index db299bcb261..ad3fcf4b023 100644
--- a/ocaml/tests/bench/test_basics/ezbechamel_basics.ml
+++ b/ocaml/tests/bench/test_basics/ezbechamel_basics.ml
@@ -32,11 +32,19 @@ let parallel_c_work () =
 
 let args = [1; 2; 4; 8; 16]
 
+open Ezbechamel_concurrent
+
 let () =
   Ezbechamel_alcotest_notty.run
     [
       Test.make ~name:"overhead" (Staged.stage ignore)
-    ; Test.make ~name:"fixedwork" (Staged.stage parallel_c_work)
+    ; Test.make_grouped ~name:"fixedwork"
+        [
+          Test.make ~name:"fixedwork" (Staged.stage parallel_c_work)
+        ; test_concurrently ~allocate:ignore ~free:ignore
+            ~name:"concurrent fixedwork"
+            Staged.(stage parallel_c_work)
+        ]
     ; Test.make_indexed ~name:"Thread create/join" ~args (fun n ->
           Staged.stage @@ fun () ->
           let threads = Array.init n @@ Thread.create ignore in
@@ -60,7 +68,9 @@ let () =
         ; test_barrier (module BarrierBinary)
         ; test_barrier (module BarrierCounting)
         ; test_barrier (module BarrierBinaryArray)
-        ; Ezbechamel_concurrent.test_concurrently ~allocate:ignore ~free:ignore ~name:"concurrent workers" Staged.(stage ignore)
+        ; test_concurrently ~allocate:ignore ~free:ignore
+            ~name:"concurrent workers"
+            Staged.(stage ignore)
         ; test_barrier (module BarrierYield)
         ]
       )