Faster timers based on rdtsc instead of chrono #972

valassi · 2024-08-19T21:23:11Z

Faster timers based on rdtsc instead of chrono

I file this explcitly as an issue to make this clearer. This was quite a bit of work in the last few days.

I have completed the whole task

the new timer.h (I kept the same name for simplicity, but it's a different beast) has a new chrono timer with a different API and a nanosecond tick granularity, but especially includes also a rdtcs timer
the calibration of tcs ticks to time is done only ONCE at the end... so maybe the relative ticks are not very precise if frequency varies wildly during a job, but I think that's precise enough
the new timers have lower overhead (not too much, but still useful) when calling them very often eg inside sample_get_x
I have completed the new counters.cc an dthe new timermap.h to adapt to this new ticks based API with one calibration at the end

The changes are used in many PRs, but I would say that they can be merged from PR #962

valassi · 2024-08-19T21:35:02Z

The code is essentially here
https://github.com/madgraph5/madgraph4gpu/blob/9a03440ebf8dfd8f9b7302747549c90b23ed5c01/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/timer.h

Note, this was developed thanks to a lot of documentation by many people, including these and many others

valassi · 2024-08-20T09:45:15Z

(Note: the overhead of std chrono clocks is in any case much less than we had in the past due to O/S issues #116)

valassi · 2024-08-20T09:59:49Z

See this commit in PR #970: c863d69

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9808s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1248s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0676s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.7899s for  1170103 events => throughput is 2.38E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1042s for    49152 events => throughput is 2.12E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1327s for    16384 events => throughput is 8.10E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.08E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0652s for    16384 events => throughput is 3.98E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1165s for  1170103 events => throughput is 9.95E-08 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4685s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0261s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.6663s for 15211307 events => throughput is 1.10E-07 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.9459s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    4.7930s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1701s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.5324s for  1170103 events => throughput is 3.02E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1024s for    49152 events => throughput is 2.08E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1323s for    16384 events => throughput is 8.07E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0525s for    16384 events => throughput is 3.20E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0647s for    16384 events => throughput is 3.95E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1415s for  1170103 events => throughput is 1.21E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4695s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0258s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.0375s for 15211307 events => throughput is 1.34E-07 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.7584s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

valassi · 2024-08-21T08:05:01Z

One thing that may still be interesting is subtracting the overhead of the counters from the counter measurements. (See for instance https://www.strchr.com/performance_measurements_with_rdtsc#:~:text=Subtracting%20overhead,%2C%20100%20000%20clock%20cycles)

In first approximation it should be enough to run say 1000 start/stop and time how long that takes, for instance. One could then report the estimated timer overhead (and/or subtract it from all individual and the overall timers)

… branch prof0 (new timers madgraph5#972) and add copyright

valassi self-assigned this Aug 19, 2024

valassi linked a pull request Aug 19, 2024 that will close this issue

WIP Improve timers (lower overhead using rdtcs) and profile additional fortran components (other than MEs) #962

Draft

valassi mentioned this issue Aug 19, 2024

WIP: studies on CMS DY #946

Draft

valassi mentioned this issue Aug 20, 2024

High CPU use in clock_gettime (TSC clocksource unavailable on itscrd03) #116

Closed

valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 7, 2024

[prof0] ** COMPLETE PROF0 ** in CHANGELOG.md, document the changes in…

0fcd5d3

… branch prof0 (new timers madgraph5#972) and add copyright

valassi linked a pull request Oct 7, 2024 that will close this issue

(4 in pipeline) Faster RDTSC-based timers and new timer/counter APIs #1018

Draft

valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 7, 2024

[prof0] ** COMPLETE PROF0 ** in CHANGELOG.md, document the changes in…

4aae106

… branch prof0 (new timers madgraph5#972) and add copyright

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster timers based on rdtsc instead of chrono #972

Faster timers based on rdtsc instead of chrono #972

valassi commented Aug 19, 2024

valassi commented Aug 19, 2024 •

edited

Loading

valassi commented Aug 20, 2024

valassi commented Aug 20, 2024

valassi commented Aug 21, 2024

Faster timers based on rdtsc instead of chrono #972

Faster timers based on rdtsc instead of chrono #972

Comments

valassi commented Aug 19, 2024

valassi commented Aug 19, 2024 • edited Loading

valassi commented Aug 20, 2024

valassi commented Aug 20, 2024

valassi commented Aug 21, 2024

valassi commented Aug 19, 2024 •

edited

Loading