Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster timers based on rdtsc instead of chrono #972

Open
valassi opened this issue Aug 19, 2024 · 4 comments · May be fixed by #962 or #1018
Open

Faster timers based on rdtsc instead of chrono #972

valassi opened this issue Aug 19, 2024 · 4 comments · May be fixed by #962 or #1018
Assignees

Comments

@valassi
Copy link
Member

valassi commented Aug 19, 2024

Faster timers based on rdtsc instead of chrono

I file this explcitly as an issue to make this clearer. This was quite a bit of work in the last few days.

I have completed the whole task

  • the new timer.h (I kept the same name for simplicity, but it's a different beast) has a new chrono timer with a different API and a nanosecond tick granularity, but especially includes also a rdtcs timer
  • the calibration of tcs ticks to time is done only ONCE at the end... so maybe the relative ticks are not very precise if frequency varies wildly during a job, but I think that's precise enough
  • the new timers have lower overhead (not too much, but still useful) when calling them very often eg inside sample_get_x
  • I have completed the new counters.cc an dthe new timermap.h to adapt to this new ticks based API with one calibration at the end

The changes are used in many PRs, but I would say that they can be merged from PR #962

@valassi
Copy link
Member Author

valassi commented Aug 20, 2024

(Note: the overhead of std chrono clocks is in any case much less than we had in the past due to O/S issues #116)

@valassi
Copy link
Member Author

valassi commented Aug 20, 2024

See this commit in PR #970: c863d69

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9808s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1248s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0676s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.7899s for  1170103 events => throughput is 2.38E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1042s for    49152 events => throughput is 2.12E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1327s for    16384 events => throughput is 8.10E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.08E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0652s for    16384 events => throughput is 3.98E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1165s for  1170103 events => throughput is 9.95E-08 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4685s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0261s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.6663s for 15211307 events => throughput is 1.10E-07 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.9459s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    4.7930s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1701s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.5324s for  1170103 events => throughput is 3.02E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1024s for    49152 events => throughput is 2.08E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1323s for    16384 events => throughput is 8.07E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0525s for    16384 events => throughput is 3.20E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0647s for    16384 events => throughput is 3.95E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1415s for  1170103 events => throughput is 1.21E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4695s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0258s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.0375s for 15211307 events => throughput is 1.34E-07 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.7584s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

@valassi
Copy link
Member Author

valassi commented Aug 21, 2024

One thing that may still be interesting is subtracting the overhead of the counters from the counter measurements. (See for instance https://www.strchr.com/performance_measurements_with_rdtsc#:~:text=Subtracting%20overhead,%2C%20100%20000%20clock%20cycles)

In first approximation it should be enough to run say 1000 start/stop and time how long that takes, for instance. One could then report the estimated timer overhead (and/or subtract it from all individual and the overall timers)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 7, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment