Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: studies on CMS DY #946

Draft
wants to merge 303 commits into
base: master
Choose a base branch
from
Draft

WIP: studies on CMS DY #946

wants to merge 303 commits into from

Conversation

valassi
Copy link
Member

@valassi valassi commented Aug 2, 2024

This is a WIP PR with various studies on CMS Drell Yan, addressing various issues

@valassi valassi self-assigned this Aug 2, 2024
@valassi valassi marked this pull request as draft August 2, 2024 10:20
Revert "[cmsdy] in tlau add the results of x10 ppttdy012j fortran tests (manually fix the directory name)"
This reverts commit f1a9800.
…(disabling FPEs)

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -nomakeclean -ALL pp_dy012j.mad
Revert "[grid] rerun ggtt cuda tlau with latest code"
This reverts commit e72e16d.
…to grid

Fix conflicts:	epochX/cudacpp/tlau/lauX.sh
./tlau/lauX.sh -fortran gg_tt.mad -togridpack
./tlau/lauX.sh -fortran gg_tt.mad -fromgridpack
…one (with backend switch)

./tlau/lauX.sh -fortran gg_tt.mad -fromgridpack
…LL backends (with backend switch)

./tlau/lauX.sh -ALL gg_tt.mad -fromgridpack

What remains TODO
- instrument a better profiling of the time spent
- add events.lhe comparison madgraph5#956 (once fortran/cpp mismatch and second helicity is fixed)
…n itgold91)

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -nomakeclean -fortran pp_dy012j.mad -fromgridpack
…ne (with backend switch)

./tlau/lauX.sh -cppnone gg_tt.mad -fromgridpack
…LL backends (with backend switch)

./tlau/lauX.sh -ALL gg_tt.mad -fromgridpack

What remains TODO
- instrument a better profiling of the time spent
- add events.lhe comparison madgraph5#956 (once fortran/cpp mismatch and second helicity is fixed)
…madevent_interface.py and prepare to modify it

cp -dpr gg_tt.mad/madevent/bin/internal/madevent_interface.py MG5aMC_patches/

It must then be symlinked in gg_tt.mad/madevent/bin/internal:
ln -sf ../../../../MG5aMC_patches/madevent_interface.py .
git checkout prof $(git ls-tree --name-only prof */CODEGEN*txt)
Fix conflicts:
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (just hashes)
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (keep sample_get_x patches from cmsdy)
…12j and pp_dy3j to ensure that everything is in sync
@valassi
Copy link
Member Author

valassi commented Aug 23, 2024

(There have been some updates in many branches: in THIS cmsdy I included prof, I included regenerated grid, and regenerated all dy processes)

Updated status for timer/counter, grid/runcard, cmsdy and new sampling branches:

…to the subprocess of pp_dy3j I focused on in the cmsdy branch)

Note: there is no need to use no_b_mass to test phase space sampling in this specific process
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8456s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1201s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0671s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.6146s for  1087437 events => throughput is 4.16E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0961s for    32768 events => throughput is 3.41E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1713s for    16384 events => throughput is 9.56E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0488s for    16384 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0685s for    16384 events => throughput is 2.39E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1250s for  1087437 events => throughput is 8.70E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4711s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0271s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8099s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9229s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1521s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0677s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.6424s for  1087437 events => throughput is 4.12E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0971s for    32768 events => throughput is 3.37E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1721s for    16384 events => throughput is 9.52E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0488s for    16384 events => throughput is 3.35E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1388s for  1087437 events => throughput is 7.83E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4717s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0278s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0355s for    16384 events => throughput is 4.61E+05 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8873s
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4690s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1183s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0676s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2338s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0967s for    32768 events => throughput is 3.39E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1720s for    16384 events => throughput is 9.53E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0493s for    16384 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0690s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1267s for  1087437 events => throughput is 8.59E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4725s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0273s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0359s for    16384 events => throughput is 4.57E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3445s for 14136681 events => throughput is 6.03E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4331s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0359s for    16384 events => throughput is 4.57E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS ***
 [COUNTERS] PROGRAM TOTAL                         :    5.1651s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1577s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0668s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.8723s for  1087437 events => throughput is 2.81E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1724s for    16384 events => throughput is 9.50E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0510s for    16384 events => throughput is 3.21E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0689s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1411s for  1087437 events => throughput is 7.71E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4737s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0272s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.7722s for 14136681 events => throughput is 5.10E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1294s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
…ad from getTotalDurationSeconds calls

This should be ok for counters.cc but not enough for timermap.h
…econds() call and go back to the old getTotalDurationSeconds
…mer overhead if CUDACPP_RUNTIME_REMOVETIMEROVERHEAD is set

However, test counters like sample_get_x need a special handling
…UNTERS, remove special meaning of PROGRAM counters
…ng a TEST counter as included in a non-TEST counter, to subtract ovberheads
…SpaceSampling

These are the first results where timer overhead is removed: looks nice,
but the overhead should be computed in the counters.cc calls rather than in the individual timers
(this would also make more sense with respect to timermap.h where this will not be possible - remane the env, too)

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4608s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1171s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0690s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2317s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0917s for    32768 events => throughput is 3.57E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1719s for    16384 events => throughput is 9.53E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0483s for    16384 events => throughput is 3.39E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0691s for    16384 events => throughput is 2.37E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1276s for  1087437 events => throughput is 8.52E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3519s for 14136681 events => throughput is 6.01E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4251s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2204s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1550s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0697s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9335s for  1087437 events => throughput is 2.76E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0924s for    32768 events => throughput is 3.55E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1722s for    16384 events => throughput is 9.52E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0487s for    16384 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0689s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1401s for  1087437 events => throughput is 7.76E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4779s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8064s for 14136681 events => throughput is 5.04E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1846s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: RdtscTimer overhead :    0.0179s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.4668s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.2924s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.1745s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1190s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0696s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9612s for  1087437 events => throughput is 3.67E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0913s for    32768 events => throughput is 3.59E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1709s for    16384 events => throughput is 9.59E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0482s for    16384 events => throughput is 3.40E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0678s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1125s for  1087437 events => throughput is 9.67E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4716s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.0989s for 14136681 events => throughput is 6.74E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1387s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.58E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: ChronoTimer overhead :    0.0489s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.2779s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.7998s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4781s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1570s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2485s for  1087437 events => throughput is 3.35E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0930s for    32768 events => throughput is 3.52E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1716s for    16384 events => throughput is 9.55E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0474s for    16384 events => throughput is 3.46E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0681s for    16384 events => throughput is 2.41E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0929s for  1087437 events => throughput is 1.17E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4705s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.1629s for 14136681 events => throughput is 6.54E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4424s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8210s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8210s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVETIMEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.8301s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8301s
…s: this will be moved to counters alone

Revert "[prof] in gux_taptamggux.mad timer.h, add instead a getTotalOverheadSeconds() call and go back to the old getTotalDurationSeconds"
This reverts commit ad9b747.

Revert "[prof] in gux_taptamggux.mad timer.h, add the option to remove overhead from getTotalDurationSeconds calls"
This reverts commit 5c0a2ed.
…unter overhead (remove it from timer.h: there will be none for tiumermap.h)

Rename the env as CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD to make it clear that this is in the counters.cc infrastructure

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.5315s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1198s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0678s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2691s for  1087437 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1044s for    32768 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1757s for    16384 events => throughput is 9.33E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0543s for    16384 events => throughput is 3.02E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0731s for    16384 events => throughput is 2.24E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1322s for  1087437 events => throughput is 8.23E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0274s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3686s for 14136681 events => throughput is 5.97E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4957s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0358s for    16384 events => throughput is 4.57E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.2048s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1559s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.9265s for  1087437 events => throughput is 2.77E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0993s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1648s for    16384 events => throughput is 9.94E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0514s for    16384 events => throughput is 3.19E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0700s for    16384 events => throughput is 2.34E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1365s for  1087437 events => throughput is 7.97E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4711s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0264s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8006s for 14136681 events => throughput is 5.05E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.1691s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0331s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5208s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.5413s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9795s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1548s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7547s for  1087437 events => throughput is 3.95E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0988s for    32768 events => throughput is 3.32E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1639s for    16384 events => throughput is 1.00E+05 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0510s for    16384 events => throughput is 3.21E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0674s for    16384 events => throughput is 2.43E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0898s for  1087437 events => throughput is 1.21E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4700s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8855s for 14136681 events => throughput is 7.50E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.9439s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0356s for    16384 events => throughput is 4.60E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0640s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.3491s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.0455s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.3036s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2216s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0692s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.0230s for  1087437 events => throughput is 3.60E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0992s for    32768 events => throughput is 3.30E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1652s for    16384 events => throughput is 9.92E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0504s for    16384 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0684s for    16384 events => throughput is 2.39E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0716s for  1087437 events => throughput is 1.52E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4727s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9427s for 14136681 events => throughput is 7.28E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.2679s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0039s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.7998s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.7998s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0038s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    3.9067s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.0000s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9067s
…ter overhead

These are the results

(1) keep overhead

./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.4766s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1202s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0685s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    3.2400s for  1087437 events => throughput is 3.36E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1007s for    32768 events => throughput is 3.25E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1673s for    16384 events => throughput is 9.79E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0521s for    16384 events => throughput is 3.14E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0687s for    16384 events => throughput is 2.38E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1237s for  1087437 events => throughput is 8.79E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4728s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0269s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.3496s for 14136681 events => throughput is 6.02E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.4409s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    5.3144s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1588s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    4.0191s for  1087437 events => throughput is 2.71E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0996s for    32768 events => throughput is 3.29E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1660s for    16384 events => throughput is 9.87E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0508s for    16384 events => throughput is 3.22E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0704s for    16384 events => throughput is 2.33E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1482s for  1087437 events => throughput is 7.34E+06 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4718s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    2.8646s for 14136681 events => throughput is 4.94E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    5.2787s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

(2) remove overhead

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0338s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.8244s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.8905s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.9339s
 [COUNTERS] Fortran Other                  (  0 ) :    0.2954s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.7332s for  1087437 events => throughput is 3.98E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1003s for    32768 events => throughput is 3.27E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1688s for    16384 events => throughput is 9.71E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0507s for    16384 events => throughput is 3.23E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0695s for    16384 events => throughput is 2.36E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0924s for  1087437 events => throughput is 1.18E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4692s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.8723s for 14136681 events => throughput is 7.55E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    3.8982s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0357s for    16384 events => throughput is 4.59E+05 events/s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0637s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    5.8826s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    1.6786s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    4.2040s
 [COUNTERS] Fortran Other                  (  0 ) :    0.4831s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0691s
 [COUNTERS] Fortran PhaseSpaceSampling     (  3 ) :    2.9924s for  1087437 events => throughput is 3.63E+05 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0983s for    32768 events => throughput is 3.33E+05 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1669s for    16384 events => throughput is 9.81E+04 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0506s for    16384 events => throughput is 3.24E+05 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0676s for    16384 events => throughput is 2.42E+05 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.0698s for  1087437 events => throughput is 1.56E+07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4712s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s
 [COUNTERS] TEST    SampleGetX             ( 21 ) :    1.9227s for 14136681 events => throughput is 7.35E+06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 31 ) :    4.1690s
 [COUNTERS] OVERALL MEs                    ( 32 ) :    0.0350s for    16384 events => throughput is 4.68E+05 events/s

(3) remove overhead, disable individual timers (so here the overhead is 0)

CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0333s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.1897s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.3330s
 -------------------------------------------------------------
 [COUNTERS] *** USING RDTSC-BASED TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8567s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_REMOVECOUNTEROVERHEAD=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 INFO: COUNTERS overhead :    0.0659s for 1M start/stop cycles
 [COUNTERS] PROGRAM TOTAL+COUNTEROVERHEAD         :    4.5119s
 [COUNTERS] PROGRAM COUNTEROVERHEAD               :    0.6594s
 -------------------------------------------------------------
 [COUNTERS] *** USING STD::CHRONO TIMERS (remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8525s

(4) do not remove overhead, disable individual timers (remove also the overhead from the estimation of the overhead)
(this test was done on another day on the same machine and build, but the results are compatible with the previous ones)

CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING RDTSC-BASED TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8072s

CUDACPP_RUNTIME_USECHRONOTIMERS=1 CUDACPP_RUNTIME_DISABLECALLTIMERS=1 \
./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggtt_x1_cudacpp
 [COUNTERS] *** USING STD::CHRONO TIMERS (do not remove timer overhead) ***
 [COUNTERS] PROGRAM TOTAL                         :    3.8214s
…r merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…Source/makefile madgraph5#980) into prof

(Checked that regenerating gg_tt.mad is all ok)
…r merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…er merging

git checkout upstream/master $(git ls-tree --name-only upstream/master */CODEGEN*txt)
…adgraph5#980) into cmsdy

Fix conflicts:
- epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common (remove Source/makefile)
- epochX/cudacpp/CODEGEN/allGenerateAndCompare.sh (add processes from both branches)

(Checked that regenerating gg_tt.mad is ok)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants