-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we have a timing overhead again? #685
Comments
To be done anyway: compile counters with -O3 as suggested by @hageboeck From the discussion: "we are being unfair to Fortran because there is an overhead in its timing measurements" |
Great presentation by @hageboeck! 🔥 The flamegraph analysis shared in this video (link: https://indico.cern.ch/event/1292145/) reveals some interesting findings about the overhead in fortran counters. 📈 It would be worth looking into whether we're utilizing the "good" clock source on our operating system or if we're still relying on those expensive system calls. 🕵️♂️ Check out the engaging discussions in threads #116 and maybe #117 for more insights on this topic. ⚡️ However, at the moment, addressing this issue seems to be of low priority. Keep up the great work! 👍 |
This is now addressed in #740 and will be fixed in an upcoming patch by @hageboeck |
…es the old Timer with a new interface, will also include a new RDTSC based timer) madgraph5#685 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp Found 997 events. Wrote 59 events. Actual xsec 5.9274488566377981 [COUNTERS] PROGRAM TOTAL : 4.7737s [COUNTERS] Fortran Other ( 0 ) : 0.1561s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0677s [COUNTERS] Fortran Random2Momenta ( 3 ) : 3.5233s for 1170103 events => throughput is 3.01E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1038s for 49152 events => throughput is 2.11E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1370s for 16384 events => throughput is 8.36E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0533s for 16384 events => throughput is 3.25E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0667s for 16384 events => throughput is 4.07E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1347s for 1170103 events => throughput is 1.15E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 2.0438s for 15211307 events => throughput is 1.34E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4699s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0265s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.7388s [COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s
… RdtscTimer - the overhead is clearly lower! madgraph5#685 The results are calibrated on the average frequency during the existence of the timer, but this should be good enough In P0_gux_taptamggux CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp Found 997 events. Wrote 59 events. Actual xsec 5.9274488566377981 [COUNTERS] PROGRAM TOTAL : 3.9949s [COUNTERS] Fortran Other ( 0 ) : 0.1231s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0677s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.7841s for 1170103 events => throughput is 2.38E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1070s for 49152 events => throughput is 2.18E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1397s for 16384 events => throughput is 8.53E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0539s for 16384 events => throughput is 3.29E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0684s for 16384 events => throughput is 4.18E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1226s for 1170103 events => throughput is 1.05E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 1.6560s for 15211307 events => throughput is 1.09E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4666s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 3.9599s [COUNTERS] OVERALL MEs ( 22 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s
…f CUDACPP_RUNTIME_USECHRONOTIMERS is set (but rdtsc is now the default) madgraph5#685 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.0091s [COUNTERS] Fortran Other ( 0 ) : 0.1216s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0700s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.8138s for 1170103 events => throughput is 2.40E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1008s for 49152 events => throughput is 2.05E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1308s for 16384 events => throughput is 7.98E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0497s for 16384 events => throughput is 3.03E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0648s for 16384 events => throughput is 3.95E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1283s for 1170103 events => throughput is 1.10E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 1.6825s for 15211307 events => throughput is 1.11E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4686s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0261s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0346s for 16384 events => throughput is 2.11E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 3.9745s [COUNTERS] OVERALL MEs ( 22 ) : 0.0346s for 16384 events => throughput is 2.11E-06 events/s CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.9952s [COUNTERS] Fortran Other ( 0 ) : 0.1855s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0721s [COUNTERS] Fortran Random2Momenta ( 3 ) : 3.6892s for 1170103 events => throughput is 3.15E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1060s for 49152 events => throughput is 2.16E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1393s for 16384 events => throughput is 8.50E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0533s for 16384 events => throughput is 3.25E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0693s for 16384 events => throughput is 4.23E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1448s for 1170103 events => throughput is 1.24E-07 events/s [COUNTERS] PROGRAM SampleGetX ( 10 ) : 2.1239s for 15211307 events => throughput is 1.40E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4745s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0265s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0347s for 16384 events => throughput is 2.12E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.9604s [COUNTERS] OVERALL MEs ( 22 ) : 0.0347s for 16384 events => throughput is 2.12E-06 events/s
I am not sure if we have a systematic timing overhead as in the past. At that time this was a problem with the O/S not using the right std::chrono implementation. I think this is gone. However, even on a good O/S, the std::chrono implementation of timers does have an overhead. This is qute obvious in my timing exercises for sample_get_x in WIP PR #970. Because of that, I implemented a new timer implementation based on rdtsc. This may have issues (or not?) in multi threaded environments, but for the timing exercises I do most of the time it looks good enough. This is in PR #970. See for instance here 2cddbdb
I keep rdtsc as the default. There is an env variable which falls back to chrono. I am closing this now. |
From @hageboeck presentation https://indico.cern.ch/event/1292145/
There is a flamegraph where fortran counters (again!) seem to have an overhead
To be checked: are we using the "good" clock source on the O/S, or still the one that relies on expensive system cals?
Se ethe long discussions in #116 (and maybe #117)
Low priority
The text was updated successfully, but these errors were encountered: