Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we have a timing overhead again? #685

Closed
valassi opened this issue Jun 6, 2023 · 4 comments
Closed

Do we have a timing overhead again? #685

valassi opened this issue Jun 6, 2023 · 4 comments

Comments

@valassi
Copy link
Member

valassi commented Jun 6, 2023

From @hageboeck presentation https://indico.cern.ch/event/1292145/
There is a flamegraph where fortran counters (again!) seem to have an overhead

To be checked: are we using the "good" clock source on the O/S, or still the one that relies on expensive system cals?

Se ethe long discussions in #116 (and maybe #117)

Low priority

@valassi
Copy link
Member Author

valassi commented Jun 6, 2023

To be done anyway: compile counters with -O3 as suggested by @hageboeck
See [6aed8a8](https://github.com/madgraph5/madgraph4gpu/commit/6aed8a8957a7292d690d0cb5208bf889be482e57)957a7292d690d0cb5208bf889be482e57 in MR #642

From the discussion: "we are being unfair to Fortran because there is an overhead in its timing measurements"

@cosmicexplorer47
Copy link

Great presentation by @hageboeck! 🔥 The flamegraph analysis shared in this video (link: https://indico.cern.ch/event/1292145/) reveals some interesting findings about the overhead in fortran counters. 📈 It would be worth looking into whether we're utilizing the "good" clock source on our operating system or if we're still relying on those expensive system calls. 🕵️‍♂️ Check out the engaging discussions in threads #116 and maybe #117 for more insights on this topic. ⚡️ However, at the moment, addressing this issue seems to be of low priority. Keep up the great work! 👍

@valassi
Copy link
Member Author

valassi commented Aug 9, 2023

To be done anyway: compile counters with -O3 as suggested by @hageboeck See [6aed8a8957a7292d690d0cb5208bf889be482e57]([6aed8a8](https://github.com/madgraph5/madgraph4gpu/commit/6aed8a8957a7292d690d0cb5208bf889be482e57)957a7292d690d0cb5208bf889be482e57) in MR #642

From the discussion: "we are being unfair to Fortran because there is an overhead in its timing measurements"

This is now addressed in #740 and will be fixed in an upcoming patch by @hageboeck

valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 16, 2024
…es the old Timer with a new interface, will also include a new RDTSC based timer) madgraph5#685

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    4.7737s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1561s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0677s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.5233s for  1170103 events => throughput is 3.01E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1038s for    49152 events => throughput is 2.11E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1370s for    16384 events => throughput is 8.36E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0533s for    16384 events => throughput is 3.25E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0667s for    16384 events => throughput is 4.07E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1347s for  1170103 events => throughput is 1.15E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    2.0438s for 15211307 events => throughput is 1.34E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4699s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0265s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.7388s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 16, 2024
… RdtscTimer - the overhead is clearly lower! madgraph5#685

The results are calibrated on the average frequency during the existence of the timer, but this should be good enough

In P0_gux_taptamggux
CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    3.9949s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1231s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0677s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.7841s for  1170103 events => throughput is 2.38E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1070s for    49152 events => throughput is 2.18E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1397s for    16384 events => throughput is 8.53E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0539s for    16384 events => throughput is 3.29E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0684s for    16384 events => throughput is 4.18E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1226s for  1170103 events => throughput is 1.05E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    1.6560s for 15211307 events => throughput is 1.09E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4666s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    3.9599s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 16, 2024
…f CUDACPP_RUNTIME_USECHRONOTIMERS is set (but rdtsc is now the default) madgraph5#685

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.0091s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1216s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0700s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.8138s for  1170103 events => throughput is 2.40E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1008s for    49152 events => throughput is 2.05E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1308s for    16384 events => throughput is 7.98E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0497s for    16384 events => throughput is 3.03E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0648s for    16384 events => throughput is 3.95E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1283s for  1170103 events => throughput is 1.10E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    1.6825s for 15211307 events => throughput is 1.11E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4686s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0261s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    3.9745s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.9952s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1855s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0721s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.6892s for  1170103 events => throughput is 3.15E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1060s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1393s for    16384 events => throughput is 8.50E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0533s for    16384 events => throughput is 3.25E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0693s for    16384 events => throughput is 4.23E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1448s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    2.1239s for 15211307 events => throughput is 1.40E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0265s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0347s for    16384 events => throughput is 2.12E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.9604s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0347s for    16384 events => throughput is 2.12E-06 events/s
@valassi
Copy link
Member Author

valassi commented Aug 16, 2024

I am not sure if we have a systematic timing overhead as in the past. At that time this was a problem with the O/S not using the right std::chrono implementation. I think this is gone.

However, even on a good O/S, the std::chrono implementation of timers does have an overhead. This is qute obvious in my timing exercises for sample_get_x in WIP PR #970.

Because of that, I implemented a new timer implementation based on rdtsc. This may have issues (or not?) in multi threaded environments, but for the timing exercises I do most of the time it looks good enough. This is in PR #970.

See for instance here 2cddbdb


 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9912s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1225s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.8019s for  1170103 events => throughput is 2.39E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1025s for    49152 events => throughput is 2.09E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1329s for    16384 events => throughput is 8.11E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0513s for    16384 events => throughput is 3.13E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0653s for    16384 events => throughput is 3.99E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1188s for  1170103 events => throughput is 1.02E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    1.6723s for 15211307 events => throughput is 1.10E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4680s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0262s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    3.9563s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    4.8516s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1718s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0679s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.5816s for  1170103 events => throughput is 3.06E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1041s for    49152 events => throughput is 2.12E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1351s for    16384 events => throughput is 8.24E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0520s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0653s for    16384 events => throughput is 3.98E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1432s for  1170103 events => throughput is 1.22E-07 events/s
 [COUNTERS] PROGRAM SampleGetX             ( 10 ) :    2.0685s for 15211307 events => throughput is 1.36E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4697s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0261s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.8167s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

I keep rdtsc as the default. There is an env variable which falls back to chrono.

I am closing this now.

@valassi valassi closed this as completed Aug 16, 2024
@valassi valassi reopened this Aug 16, 2024
@valassi valassi closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants