Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timers (lower overhead using rdtcs) and profile additional fortran components (other than MEs) #962

Open
wants to merge 96 commits into
base: master
Choose a base branch
from

Conversation

valassi
Copy link
Member

@valassi valassi commented Aug 10, 2024

This is a very WIP PR extending the work in #960. Again related to the CMS #943 issue reported by @choij1589

The idea is to further improve timers and profile other fortran components

I added profiling to

  • random to momenta translation in phase spaces sampling ie x_to_f_args (X2F)
  • nnevolvepdf (PDF)
  • put_sample_point (I/O)

The first two are related to the findings with nice flamegraphs by @Qubitol

So far I only had time on a simple gg_tt. Even here it is quite interesting

  • sampling is actually not much (but it is there)
  • PDF is a lot (even in ggtt, I thought it would only appear in pp)
  • I/O is also potentially a lot

The problem is also that the pverhead from the counters themselves starts being important, so it is difficult to do it well. Especially the i/o counters have a large overhead.

Anyway, this is WIP. For ggtt

    ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
     [COUNTERS] PROGRAM TOTAL          :    0.7442s
     [COUNTERS] Fortran Overhead ( 0 ) :    0.2437s
     [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0871s for    16384 events => throughput is 5.32E-06 events/s
     [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
     [COUNTERS] Fortran X2F      ( 4 ) :    0.0162s for    16399 events => throughput is 9.86E-07 events/s
     [COUNTERS] Fortran PDF      ( 5 ) :    0.1335s for    98304 events => throughput is 1.36E-06 events/s
     [COUNTERS] Fortran I/O      ( 6 ) :    0.2629s for    16399 events => throughput is 1.60E-05 events/s
    
    ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
     [COUNTERS] PROGRAM TOTAL          :    1.9099s
     [COUNTERS] Fortran Overhead ( 0 ) :    0.3233s
     [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5203s for    98304 events => throughput is 5.29E-06 events/s
     [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
     [COUNTERS] Fortran X2F      ( 4 ) :    0.0956s for    98371 events => throughput is 9.71E-07 events/s
     [COUNTERS] Fortran PDF      ( 5 ) :    0.7980s for   589824 events => throughput is 1.35E-06 events/s
     [COUNTERS] Fortran I/O      ( 6 ) :    0.1719s for    98371 events => throughput is 1.75E-06 events/s

But very often the same command gives 4s, so not very reliable...

@valassi valassi self-assigned this Aug 10, 2024
…toring of counters using maps and explicit register methods
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4510s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.3466s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0871s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0164s for    16399 events => throughput is 1.00E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
INFO: No Floating Point Exceptions have been reported
 [COUNTERS] PROGRAM TOTAL          :    1.9073s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.2890s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5218s for    98304 events => throughput is 5.31E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0958s for    98371 events => throughput is 9.74E-07 events/s
…ke cleanall and rebuild)

Note: the counter itself has a huge overhead...

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7742s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.5162s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0906s for    16384 events => throughput is 5.53E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0174s for    16399 events => throughput is 1.06E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1493s for    98304 events => throughput is 1.52E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1335s
 [COUNTERS] Fortran Overhead ( 0 ) :    2.6717s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5176s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0961s for    98371 events => throughput is 9.77E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8474s for   589824 events => throughput is 1.44E-06 events/s
…ain, to reduce performance overhead from counters themselves

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4700s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.2236s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0867s for    16384 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0162s for    16399 events => throughput is 9.88E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1428s for    98304 events => throughput is 1.45E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9569s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.4895s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5181s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0958s for    98371 events => throughput is 9.74E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8528s for   589824 events => throughput is 1.45E-06 events/s
…points

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7442s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2437s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0871s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0162s for    16399 events => throughput is 9.86E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1335s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2629s for    16399 events => throughput is 1.60E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9099s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3233s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5203s for    98304 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0956s for    98371 events => throughput is 9.71E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7980s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1719s for    98371 events => throughput is 1.75E-06 events/s
@valassi
Copy link
Member Author

valassi commented Aug 10, 2024

Note: the 'fortran overhead' now is something I should rename as 'other' (not pdf, not x2f, not i/o?... I suspect it is related to io anyway)

@valassi
Copy link
Member Author

valassi commented Aug 10, 2024

The instrumentation of sample_put_point is here
22ce65a

NB: there is some hysteresis, the timing results depend on what was executed before
For instance, x1 results may be 0.7 or 1.5, and x10 results may be 1.5 or 4.1: this does NOT depend on the software version!

Start with x1, several times, eventually it gives 0.7
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7417s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2435s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0861s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1345s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2603s for    16399 events => throughput is 1.59E-05 events/s

Then the FIRST execution of x10 gives 1.9
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9285s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3277s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5237s for    98304 events => throughput is 5.33E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0964s for    98371 events => throughput is 9.80E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8057s for   589824 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1741s for    98371 events => throughput is 1.77E-06 events/s

But the SECOND execution gives 4.1s! With the big increase coming from the I/O part
(And any subsequent execution also gives the same)
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1048s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.1119s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5161s for    98304 events => throughput is 5.25E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0946s for    98371 events => throughput is 9.62E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7954s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    1.5861s for    98371 events => throughput is 1.61E-05 events/s

Now the FIRST execution of x1 gives 1.4s!
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.4677s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.5601s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0861s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0167s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1338s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.6702s for    16399 events => throughput is 4.09E-05 events/s

But the SECOND execution gives again 0.7s! And all subsequent executions too (so we are back at the beginning of the loop above)
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7480s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2472s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0870s for    16384 events => throughput is 5.31E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1337s for    98304 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2628s for    16399 events => throughput is 1.60E-05 events/s

In the following, I will quote results for the second x1 and the first x10 only...
…een defined

I had done this to try and decrease the 4.1s... but in the meantime I understood the problem is elsewhere.
In particular, this is not faster than string comparison - will revert!

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7451s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2426s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0875s for    16384 events => throughput is 5.34E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0170s for    16399 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1342s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2631s for    16399 events => throughput is 1.60E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.8970s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3151s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5182s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0007s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0952s for    98371 events => throughput is 9.67E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7950s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1729s for    98371 events => throughput is 1.76E-06 events/s
…g if a counter has been defined: use string comparison to "", it is not slower

Revert "[prof] in gg_tt.mad counters.cc add a flag showing if a counter has been defined"
This reverts commit ee6f9f5.
…BLECOUNTERS to disable individual counters

I initially wanted to use this to check if it is the individual counters that caused the 4.1s in x10 tests.
But in the meantime I understood that the problem is elsewhere, and that timings depend on execution order! Will probably revert!

Note, the second x1 execution takes 0.7s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7485s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.2472s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.0872s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.1346s for    98304 events => throughput is 1.37E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.2621s for    16399 events => throughput is 1.60E-05 events/s

CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    0.7349s

And then the first x10 execution takes 1.9s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.9127s
 [COUNTERS] Fortran Overhead ( 0 ) :    0.3268s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5172s for    98304 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0964s for    98371 events => throughput is 9.80E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.7992s for   589824 events => throughput is 1.36E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    0.1723s for    98371 events => throughput is 1.75E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    1.8511s

While the SECOND execution x10 takes 4.1s, with or without CUDACPP_RUNTIME_DISABLECOUNTERS
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.1152s
 [COUNTERS] Fortran Overhead ( 0 ) :    1.1174s
 [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5173s for    98304 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0008s
 [COUNTERS] Fortran X2F      ( 4 ) :    0.0950s for    98371 events => throughput is 9.65E-07 events/s
 [COUNTERS] Fortran PDF      ( 5 ) :    0.8117s for   589824 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O      ( 6 ) :    1.5731s for    98371 events => throughput is 1.60E-05 events/s

CUDACPP_RUNTIME_DISABLECOUNTERS=1 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL          :    4.0680s

Will therefore revert this
…CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters

Revert "[prof] in gg_tt.mad counters add an env variable CUDACPP_RUNTIME_DISABLECOUNTERS to disable individual counters"
This reverts commit 0681a76.
…ther and make it counter[0]

No change in the timings

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    0.7531s
 [COUNTERS] Fortran Other    (  0 ) :    0.2447s
 [COUNTERS] CudaCpp MEs      (  2 ) :    0.0862s for    16384 events => throughput is 5.26E-06 events/s
 [COUNTERS] CudaCpp HEL      (  3 ) :    0.0007s
 [COUNTERS] Fortran X2F      (  4 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF      (  5 ) :    0.1395s for    98304 events => throughput is 1.42E-06 events/s
 [COUNTERS] Fortran I/O      (  6 ) :    0.2653s for    16399 events => throughput is 1.62E-05 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    1.9572s
 [COUNTERS] Fortran Other    (  0 ) :    0.3215s
 [COUNTERS] CudaCpp MEs      (  2 ) :    0.5202s for    98304 events => throughput is 5.29E-06 events/s
 [COUNTERS] CudaCpp HEL      (  3 ) :    0.0007s
 [COUNTERS] Fortran X2F      (  4 ) :    0.0941s for    98371 events => throughput is 9.57E-07 events/s
 [COUNTERS] Fortran PDF      (  5 ) :    0.8486s for   589824 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  6 ) :    0.1720s for    98371 events => throughput is 1.75E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    0.7543s
 [COUNTERS] Fortran Other    (  0 ) :    0.2451s
 [COUNTERS] Fortran X2F      (  1 ) :    0.0163s for    16399 events => throughput is 9.95E-07 events/s
 [COUNTERS] Fortran PDF      (  2 ) :    0.1419s for    98304 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  3 ) :    0.2617s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL      (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs      (  6 ) :    0.0885s for    16384 events => throughput is 5.40E-06 events/s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL           :    1.9649s
 [COUNTERS] Fortran Other    (  0 ) :    0.3239s
 [COUNTERS] Fortran X2F      (  1 ) :    0.0951s for    98371 events => throughput is 9.67E-07 events/s
 [COUNTERS] Fortran PDF      (  2 ) :    0.8467s for   589824 events => throughput is 1.44E-06 events/s
 [COUNTERS] Fortran I/O      (  3 ) :    0.1783s for    98371 events => throughput is 1.81E-06 events/s
 [COUNTERS] CudaCpp HEL      (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs      (  6 ) :    0.5202s for    98304 events => throughput is 5.29E-06 events/s
…xcluded from fortran other calculation)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7510s
 [COUNTERS] Fortran Other        (  0 ) :    0.2485s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0163s for    16399 events => throughput is 9.94E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1359s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.2628s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0868s for    16384 events => throughput is 5.30E-06 events/s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6822s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    1.9135s
 [COUNTERS] Fortran Other        (  0 ) :    0.3225s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0938s for    98371 events => throughput is 9.54E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.7961s for   589824 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.1819s for    98371 events => throughput is 1.85E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5184s for    98304 events => throughput is 5.27E-06 events/s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.8445s
… that what is left is something inside sample_full

Rephrasing: programtotal = samplefull + initialIO
And FortranOther is inside sample_full

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7697s
 [COUNTERS] Fortran Other        (  0 ) :    0.1810s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0166s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1355s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.2672s for    16399 events => throughput is 1.63E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0877s for    16384 events => throughput is 5.35E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0808s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6860s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    2.0621s
 [COUNTERS] Fortran Other        (  0 ) :    0.2829s
 [COUNTERS] Fortran X2F          (  1 ) :    0.1024s for    98371 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.8580s for   589824 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran I/O          (  3 ) :    0.1838s for    98371 events => throughput is 1.87E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5532s for    98304 events => throughput is 5.63E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0811s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.9780s
…side the function to the calling sequence in sample_full

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7679s
 [COUNTERS] Fortran Other        (  0 ) :    0.1849s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0169s for    16399 events => throughput is 1.03E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1380s for    98304 events => throughput is 1.40E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2611s for    16399 events => throughput is 1.59E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0877s for    16384 events => throughput is 5.35E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0785s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6862s

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x10_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    1.9454s
 [COUNTERS] Fortran Other        (  0 ) :    0.2618s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0961s for    98371 events => throughput is 9.77E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.8161s for   589824 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.1695s for    98371 events => throughput is 1.72E-06 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.5216s for    98304 events => throughput is 5.31E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0794s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    1.8627s
…ing (as "test12" for the moment, wip)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7447s
 [COUNTERS] Fortran Other        (  0 ) :    0.1308s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0163s for    16399 events => throughput is 9.93E-07 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1328s for    98304 events => throughput is 1.35E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2614s for    16399 events => throughput is 1.59E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0878s for    16384 events => throughput is 5.36E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0649s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6768s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0499s for    16384 events => throughput is 3.05E-06 events/s
…or the moment, wip)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7526s
 [COUNTERS] Fortran Other        (  0 ) :    0.1163s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1428s for    98304 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2589s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0870s for    16384 events => throughput is 5.31E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0659s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6829s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0537s for    16384 events => throughput is 3.28E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0108s for    16384 events => throughput is 6.58E-07 events/s
This essentially completes the identification of all bottlenecks.
Must now clean up the timers (and remove double counting, "Fortran Other" is now negative?)

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7581s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0298s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1441s for    98304 events => throughput is 1.47E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2627s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0882s for    16384 events => throughput is 5.38E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0656s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6896s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0533s for    16384 events => throughput is 3.25E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0105s for    16384 events => throughput is 6.41E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1461s for    16384 events => throughput is 8.91E-06 events/s
./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7519s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0299s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.01E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1421s for    98304 events => throughput is 1.45E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2589s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0873s for    16384 events => throughput is 5.33E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0651s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6838s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0542s for    16384 events => throughput is 3.31E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0102s for    16384 events => throughput is 6.26E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1467s for    16384 events => throughput is 8.95E-06 events/s
…er.f

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7533s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0253s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0165s for    16399 events => throughput is 1.00E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.1355s for    98304 events => throughput is 1.38E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2633s for    16399 events => throughput is 1.61E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0897s for    16384 events => throughput is 5.48E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0649s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6855s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0490s for    16384 events => throughput is 2.99E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0102s for    16384 events => throughput is 6.20E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1488s for    16384 events => throughput is 9.08E-06 events/s
…g1.f

This changes the overall balance, now Fortran Other is again positive.
This is because pdg2pdf is also called elsewhere (e.g. in unwgt?) which was already profiled elsewhere.

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7551s
 [COUNTERS] Fortran Other        (  0 ) :    0.0111s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0986s for    32768 events => throughput is 3.01E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2633s for    16399 events => throughput is 1.61E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0879s for    16384 events => throughput is 5.36E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0662s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6862s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0515s for    16384 events => throughput is 3.14E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0099s for    16384 events => throughput is 6.07E-07 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1492s for    16384 events => throughput is 9.11E-06 events/s
Now "Fortran Other" becomes negative again, there is again some double counting

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7511s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0373s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0168s for    16399 events => throughput is 1.02E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0965s for    32768 events => throughput is 2.94E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2598s for    16399 events => throughput is 1.58E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0008s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0868s for    16384 events => throughput is 5.30E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0670s
 [COUNTERS] PROGRAM sample_full  ( 11 ) :    0.6811s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0506s for    16384 events => throughput is 3.09E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0099s for    16384 events => throughput is 6.01E-07 events/s
 [COUNTERS] Fortran TEST3        ( 14 ) :    0.0541s for    16384 events => throughput is 3.30E-06 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1462s for    16384 events => throughput is 8.93E-06 events/s
This makes it clearer that programtotal = samplefull + initialIO

./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL               :    0.7554s
 [COUNTERS] Fortran Other        (  0 ) :   -0.0393s
 [COUNTERS] Fortran X2F          (  1 ) :    0.0171s for    16399 events => throughput is 1.04E-06 events/s
 [COUNTERS] Fortran PDF          (  2 ) :    0.0984s for    32768 events => throughput is 3.00E-06 events/s
 [COUNTERS] Fortran final_I/O    (  3 ) :    0.2621s for    16399 events => throughput is 1.60E-05 events/s
 [COUNTERS] CudaCpp HEL          (  5 ) :    0.0007s
 [COUNTERS] CudaCpp MEs          (  6 ) :    0.0872s for    16384 events => throughput is 5.32E-06 events/s
 [COUNTERS] Fortran initial_I/O  (  7 ) :    0.0688s
 [COUNTERS] Fortran TEST         ( 12 ) :    0.0521s for    16384 events => throughput is 3.18E-06 events/s
 [COUNTERS] Fortran TEST2        ( 13 ) :    0.0100s for    16384 events => throughput is 6.08E-07 events/s
 [COUNTERS] Fortran TEST3        ( 14 ) :    0.0507s for    16384 events => throughput is 3.09E-06 events/s
 [COUNTERS] Fortran TEST5        ( 16 ) :    0.1478s for    16384 events => throughput is 9.02E-06 events/s
 [COUNTERS] PROGRAM initial_I/O  ( 19 ) :    0.0688s
 [COUNTERS] PROGRAM sample_full  ( 20 ) :    0.6838s
@valassi
Copy link
Member Author

valassi commented Aug 21, 2024

I resynced this to upstream/master where I already merged hel #960

@valassi
Copy link
Member Author

valassi commented Aug 21, 2024

Mac builds fail because rdtsc is not defined on ARM
image

https://github.com/madgraph5/madgraph4gpu/actions/runs/10495089681/job/29072804599?pr=962

gfortran-14 -w -fPIC -O3 -ffast-math -fbounds-check -ffixed-line-length-132 -w -cpp  -c matrix1.f -I../../Source/ -I../../Source/PDF/gammaUPC
c++ -O3 -Wall -Wshadow -Wextra -std=c++17 -mmacosx-version-min=11.3 -c counters.cc -o counters.o
In file included from counters.cc:6:
./timer.h:146:2: error: "rdtsc is not defined for this platform yet"
#error "rdtsc is not defined for this platform yet"
 ^
1 error generated.
make: *** [counters.o] Error 1
Error: Process completed with exit code 2.

@valassi
Copy link
Member Author

valassi commented Aug 22, 2024

Ok mac issues fixed, now back to the usual 3 expected failures

image

@valassi
Copy link
Member Author

valassi commented Aug 22, 2024

OK current status for timer/counter, grid/runcard, cmsdy and new sampling branches:

@valassi
Copy link
Member Author

valassi commented Aug 23, 2024

(There have been some updates in other branches, not in THIS prof)

Updated status for timer/counter, grid/runcard, cmsdy and new sampling branches:

…to the subprocess of pp_dy3j I focused on in the cmsdy branch)

Note: there is no need to use no_b_mass to test phase space sampling in this specific process
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8456s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1201s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0671s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.6146s for  1087437 events => throughput is 4.16E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0961s for    32768 events => throughput is 3.41E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1713s for    16384 events => throughput is 9.56E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0488s for    16384 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0685s for    16384 events => throughput is 2.39E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1250s for  1087437 events => throughput is 8.70E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4711s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0271s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8099s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9229s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1521s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0677s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.6424s for  1087437 events => throughput is 4.12E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0971s for    32768 events => throughput is 3.37E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1721s for    16384 events => throughput is 9.52E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0488s for    16384 events => throughput is 3.35E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1388s for  1087437 events => throughput is 7.83E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4717s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0278s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0355s for    16384 events => throughput is 4.61E+05 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8873s
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4690s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1183s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0676s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2338s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0967s for    32768 events => throughput is 3.39E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1720s for    16384 events => throughput is 9.53E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0493s for    16384 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0690s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1267s for  1087437 events => throughput is 8.59E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4725s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0273s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0359s for    16384 events => throughput is 4.57E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3445s for 14136681 events => throughput is 6.03E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4331s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0359s for    16384 events => throughput is 4.57E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    5.1651s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1577s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0668s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.8723s for  1087437 events => throughput is 2.81E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1724s for    16384 events => throughput is 9.50E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0510s for    16384 events => throughput is 3.21E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0689s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1411s for  1087437 events => throughput is 7.71E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4737s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0272s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.7722s for 14136681 events => throughput is 5.10E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1294s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
…ad from getTotalDurationSeconds calls

This should be ok for counters.cc but not enough for timermap.h
…econds() call and go back to the old getTotalDurationSeconds
…mer overhead if CUDACPP_RUNTIME_REMOVETIMEROVERHEAD is set

However, test counters like sample_get_x need a special handling
…UNTERS, remove special meaning of PROGRAM counters
…ng a TEST counter as included in a non-TEST counter, to subtract ovberheads
…SpaceSampling

These are the first results where timer overhead is removed: looks nice,
but the overhead should be computed in the counters.cc calls rather than in the individual timers
(this would also make more sense with respect to timermap.h where this will not be possible - remane the env, too)

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4608s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1171s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0690s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2317s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0917s for    32768 events => throughput is 3.57E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1719s for    16384 events => throughput is 9.53E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0483s for    16384 events => throughput is 3.39E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0691s for    16384 events => throughput is 2.37E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1276s for  1087437 events => throughput is 8.52E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3519s for 14136681 events => throughput is 6.01E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4251s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2204s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1550s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0697s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9335s for  1087437 events => throughput is 2.76E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0924s for    32768 events => throughput is 3.55E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1722s for    16384 events => throughput is 9.52E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0487s for    16384 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0689s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1401s for  1087437 events => throughput is 7.76E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4779s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8064s for 14136681 events => throughput is 5.04E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1846s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: RdtscTimer overhead :    0.0179s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.4668s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.2924s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.1745s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1190s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0696s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9612s for  1087437 events => throughput is 3.67E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0913s for    32768 events => throughput is 3.59E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1709s for    16384 events => throughput is 9.59E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0482s for    16384 events => throughput is 3.40E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0678s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1125s for  1087437 events => throughput is 9.67E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4716s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.0989s for 14136681 events => throughput is 6.74E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1387s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: ChronoTimer overhead :    0.0489s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.2779s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.7998s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4781s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1570s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2485s for  1087437 events => throughput is 3.35E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0930s for    32768 events => throughput is 3.52E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1716s for    16384 events => throughput is 9.55E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0474s for    16384 events => throughput is 3.46E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0681s for    16384 events => throughput is 2.41E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0929s for  1087437 events => throughput is 1.17E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4705s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.1629s for 14136681 events => throughput is 6.54E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4424s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8210s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8210s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8301s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8301s
…s: this will be moved to counters alone

Revert "[prof] in gux_taptamggux.mad timer.h, add instead a getTotalOverheadSeconds() call and go back to the old getTotalDurationSeconds"
This reverts commit ad9b747.

Revert "[prof] in gux_taptamggux.mad timer.h, add the option to remove overhead from getTotalDurationSeconds calls"
This reverts commit 5c0a2ed.
…unter overhead (remove it from timer.h: there will be none for tiumermap.h)

Rename the env as CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD to make it clear that this is in the counters.cc infrastructure

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.5315s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1198s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0678s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2691s for  1087437 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1044s for    32768 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1757s for    16384 events => throughput is 9.33E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0543s for    16384 events => throughput is 3.02E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0731s for    16384 events => throughput is 2.24E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1322s for  1087437 events => throughput is 8.23E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0274s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3686s for 14136681 events => throughput is 5.97E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4957s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2048s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1559s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9265s for  1087437 events => throughput is 2.77E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0993s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1648s for    16384 events => throughput is 9.94E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0514s for    16384 events => throughput is 3.19E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0700s for    16384 events => throughput is 2.34E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1365s for  1087437 events => throughput is 7.97E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4711s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0264s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8006s for 14136681 events => throughput is 5.05E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1691s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0331s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5208s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.5413s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9795s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1548s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7547s for  1087437 events => throughput is 3.95E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0988s for    32768 events => throughput is 3.32E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1639s for    16384 events => throughput is 1.00E+05 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0510s for    16384 events => throughput is 3.21E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0674s for    16384 events => throughput is 2.43E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0898s for  1087437 events => throughput is 1.21E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4700s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8855s for 14136681 events => throughput is 7.50E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.9439s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0640s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.3491s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.0455s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.3036s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2216s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0692s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.0230s for  1087437 events => throughput is 3.60E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0992s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1652s for    16384 events => throughput is 9.92E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0684s for    16384 events => throughput is 2.39E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0716s for  1087437 events => throughput is 1.52E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4727s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9427s for 14136681 events => throughput is 7.28E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.2679s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0039s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.7998s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.7998s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0038s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.9067s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9067s
@valassi
Copy link
Member Author

valassi commented Aug 23, 2024

I have now added (only here in prof) a mechanism to remov ethe timer overhead from the timing measurements, For RDTSC timers this looks adequate (for chrono timers a bit less, but this is not what I would use anyway)

These are the results, see valassi@6dcab81

(initially this was 6083af1, then I added the last bullet 4 and force pushed)

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4766s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1202s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0685s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2400s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1007s for    32768 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1673s for    16384 events => throughput is 9.79E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0521s for    16384 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1237s for  1087437 events => throughput is 8.79E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4728s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3496s for 14136681 events => throughput is 6.02E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4409s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.3144s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1588s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    4.0191s for  1087437 events => throughput is 2.71E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0996s for    32768 events => throughput is 3.29E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1660s for    16384 events => throughput is 9.87E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0508s for    16384 events => throughput is 3.22E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0704s for    16384 events => throughput is 2.33E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1482s for  1087437 events => throughput is 7.34E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8646s for 14136681 events => throughput is 4.94E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.2787s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0338s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.8244s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.8905s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9339s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2954s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7332s for  1087437 events => throughput is 3.98E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1003s for    32768 events => throughput is 3.27E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1688s for    16384 events => throughput is 9.71E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0507s for    16384 events => throughput is 3.23E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0695s for    16384 events => throughput is 2.36E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0924s for  1087437 events => throughput is 1.18E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4692s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8723s for 14136681 events => throughput is 7.55E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8982s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0637s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.8826s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.6786s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.2040s
 [COUNTERS] Fortran Other                  (  0 ) :    0.4831s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0691s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9924s for  1087437 events => throughput is 3.63E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1669s for    16384 events => throughput is 9.81E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0506s for    16384 events => throughput is 3.24E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0676s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0698s for  1087437 events => throughput is 1.56E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4712s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9227s for 14136681 events => throughput is 7.35E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1690s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0333s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.1897s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.3330s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8567s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0659s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5119s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.6594s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8525s

(4) do not remove overhead, disable individual timers (remove also the overhead from the estimation of the overhead)
(this test was done on another day on the same machine and build, but the results are compatible with the previous ones)

CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8072s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8214s

…ter overhead

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4766s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1202s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0685s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2400s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1007s for    32768 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1673s for    16384 events => throughput is 9.79E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0521s for    16384 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1237s for  1087437 events => throughput is 8.79E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4728s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3496s for 14136681 events => throughput is 6.02E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4409s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.3144s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1588s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    4.0191s for  1087437 events => throughput is 2.71E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0996s for    32768 events => throughput is 3.29E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1660s for    16384 events => throughput is 9.87E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0508s for    16384 events => throughput is 3.22E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0704s for    16384 events => throughput is 2.33E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1482s for  1087437 events => throughput is 7.34E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8646s for 14136681 events => throughput is 4.94E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.2787s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0338s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.8244s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.8905s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9339s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2954s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7332s for  1087437 events => throughput is 3.98E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1003s for    32768 events => throughput is 3.27E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1688s for    16384 events => throughput is 9.71E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0507s for    16384 events => throughput is 3.23E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0695s for    16384 events => throughput is 2.36E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0924s for  1087437 events => throughput is 1.18E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4692s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8723s for 14136681 events => throughput is 7.55E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8982s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0637s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.8826s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.6786s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.2040s
 [COUNTERS] Fortran Other                  (  0 ) :    0.4831s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0691s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9924s for  1087437 events => throughput is 3.63E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1669s for    16384 events => throughput is 9.81E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0506s for    16384 events => throughput is 3.24E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0676s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0698s for  1087437 events => throughput is 1.56E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4712s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9227s for 14136681 events => throughput is 7.35E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1690s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0333s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.1897s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.3330s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8567s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0659s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5119s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.6594s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8525s

(4) do not remove overhead, disable individual timers (remove also the overhead from the estimation of the overhead)
(this test was done on another day on the same machine and build, but the results are compatible with the previous ones)

CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8072s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8214s
@valassi
Copy link
Member Author

valassi commented Aug 27, 2024

Reminder of two things to do

  • make overhead removal the default
  • in the INFO message print the time for the number of cycles actually used (here 10M not 1M)

…r merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…Source/makefile madgraph5#980) into prof

(Checked that regenerating gg_tt.mad is all ok)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster timers based on rdtsc instead of chrono
2 participants