-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove unnecessary first pass on all bridge events in cudacpp helicity calculation from madevent (and improve timers) #960
Conversation
Using more /tmp/avalassi/input_ggttggg_test 256 1 1 ! Number of events and max and min iterations 0.000001 ! Accuracy (ignored because max iterations = min iterations) 0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present) 1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement) 0 ! Helicity Sum/event 0=exact 1 ! ICONFIG number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!) For CUDACPP_RUNTIME_VECSIZEUSED=256 ./madevent_fortran < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 3.2359s [COUNTERS] Fortran Overhead ( 0 ) : 0.0972s [COUNTERS] Fortran MEs ( 1 ) : 3.1387s for 256 events => throughput is 8.16E+01 events/s For CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 7.1173s [COUNTERS] Fortran Overhead ( 0 ) : 3.4293s [COUNTERS] CudaCpp MEs ( 2 ) : 3.6880s for 256 events => throughput is 6.94E+01 events/s For CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 1.5505s [COUNTERS] Fortran Overhead ( 0 ) : 0.7714s [COUNTERS] CudaCpp MEs ( 2 ) : 0.7791s for 256 events => throughput is 3.29E+02 events/s
…1.f and counters.cc, remove "counters_smatrix1_" functions and calls, which are not used anywhere There is a small but noticeable difference in ggttggg (probably much more in simpler processes?) For CUDACPP_RUNTIME_VECSIZEUSED=256 ./madevent_fortran < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 3.1335s [COUNTERS] Fortran Overhead ( 0 ) : 0.0983s [COUNTERS] Fortran MEs ( 1 ) : 3.0352s for 256 events => throughput is 8.43E+01 events/s
…l fix in gg_tt.mad)
…CudaCpp helicities madgraph5#958 CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 7.0962s [COUNTERS] Fortran Overhead ( 0 ) : 0.0969s [COUNTERS] CudaCpp MEs ( 2 ) : 3.6843s for 256 events => throughput is 6.95E+01 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 3.3149s for 256 events => throughput is 7.72E+01 events/s CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 1.5576s [COUNTERS] Fortran Overhead ( 0 ) : 0.1012s [COUNTERS] CudaCpp MEs ( 2 ) : 0.7721s for 256 events => throughput is 3.32E+02 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.6843s for 256 events => throughput is 3.74E+02 events/s
…udaCpp helicities madgraph5#958 (remove event count and throughput)
…e.inc interface, add parameter goodHelOnly as in Bridge to quit after few events in cudacpp helicity computation (fix madgraph5#958 aka madgraph5#546) CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 4.1082s [COUNTERS] Fortran Overhead ( 0 ) : 0.0979s [COUNTERS] CudaCpp MEs ( 2 ) : 3.8176s for 256 events => throughput is 6.71E+01 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.1927s CUDACPP_RUNTIME_VECSIZEUSED=256 ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_test [COUNTERS] PROGRAM TOTAL : 0.9085s [COUNTERS] Fortran Overhead ( 0 ) : 0.0995s [COUNTERS] CudaCpp MEs ( 2 ) : 0.7692s for 256 events => throughput is 3.33E+02 events/s [COUNTERS] CudaCpp HEL ( 3 ) : 0.0398s (Also fix clang formatting in counters)
…small issue to fix)
…alformed patches The only files that still need to be patched are - 3 in patch.common: Source/makefile, Source/genps.inc, SubProcesses/makefile - 3 in patch.P1: auto_dsig1.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad
…df (tested on a command line)
… added to generateAndCompare.sh
…ggg by a template
Hi @oliviermattelaer this is essentially ready for review. Can you please reviww? I renamed it as "remove unnecessary first pass on all bridge events in cudacpp helicity calculation from madevent (and improve timers)". Essentially what this does is
So I would say that all is understood. This was a nasty performance overhead. Now cudacpp should look a bit better with respect to fortran, especially if only few events are generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this would make more sense to include this in the master_goodhel
branch since the two are in a way related.
This will allow to move the HELONLY from a boolean to a float and pass limhel to the code to have this selection of helicity to be done as it should.
Or we can merge this one, and then I do the switch in master_goodhel to have that parameter as a float. (but we should not wait for merging master_goodhel anyway)
But can you comment way some file are modified on matrix.f this is weird/bad
(no isue here actually)
Cheers,
epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
Show resolved
Hide resolved
epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
Show resolved
Hide resolved
…ph5#960 (remove first cudacpp pass in helicity calculation) On itgold91: Code generation completed in 161 seconds Code generation and additional checks completed in 246 seconds
…91, with 16384 vector_size, and the removal of cudacpp helicity pass madgraph5#960 CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack
…ld91, vector_size=16384, overhead reduced PR madgraph5#960 CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -nomakeclean -ALL pp_dy3j.mad -fromgridpack ./parseGridpackLogs.sh pp_dy3j.mad/ pp_dy3j.mad//fortran/output.txt [GridPackCmd.launch] OVERALL TOTAL 443.0382 seconds [madevent COUNTERS] PROGRAM TOTAL 438.801 [madevent COUNTERS] Fortran Overhead 131.826 [madevent COUNTERS] Fortran MEs 306.975 -------------------------------------------------------------------------------- pp_dy3j.mad//cppnone/output.txt [GridPackCmd.launch] OVERALL TOTAL 443.1323 seconds [madevent COUNTERS] PROGRAM TOTAL 438.864 [madevent COUNTERS] Fortran Overhead 131.804 [madevent COUNTERS] CudaCpp MEs 306.034 [madevent COUNTERS] CudaCpp HEL 1.025 -------------------------------------------------------------------------------- pp_dy3j.mad//cppsse4/output.txt [GridPackCmd.launch] OVERALL TOTAL 290.4177 seconds [madevent COUNTERS] PROGRAM TOTAL 286.159 [madevent COUNTERS] Fortran Overhead 131.795 [madevent COUNTERS] CudaCpp MEs 153.803 [madevent COUNTERS] CudaCpp HEL 0.5612 -------------------------------------------------------------------------------- pp_dy3j.mad//cppavx2/output.txt [GridPackCmd.launch] OVERALL TOTAL 199.7083 seconds [madevent COUNTERS] PROGRAM TOTAL 195.451 [madevent COUNTERS] Fortran Overhead 131.835 [madevent COUNTERS] CudaCpp MEs 63.3324 [madevent COUNTERS] CudaCpp HEL 0.2837 -------------------------------------------------------------------------------- pp_dy3j.mad//cpp512y/output.txt [GridPackCmd.launch] OVERALL TOTAL 195.8398 seconds [madevent COUNTERS] PROGRAM TOTAL 191.538 [madevent COUNTERS] Fortran Overhead 131.888 [madevent COUNTERS] CudaCpp MEs 59.3799 [madevent COUNTERS] CudaCpp HEL 0.2715 -------------------------------------------------------------------------------- pp_dy3j.mad//cpp512z/output.txt [GridPackCmd.launch] OVERALL TOTAL 171.8862 seconds [madevent COUNTERS] PROGRAM TOTAL 167.589 [madevent COUNTERS] Fortran Overhead 131.943 [madevent COUNTERS] CudaCpp MEs 35.4473 [madevent COUNTERS] CudaCpp HEL 0.1996 -------------------------------------------------------------------------------- pp_dy3j.mad//cuda/output.txt File not found: SKIP backend cuda -------------------------------------------------------------------------------- pp_dy3j.mad//hip/output.txt File not found: SKIP backend hip --------------------------------------------------------------------------------
Hi Olivier, thanks for loking at this. One point: I really think that we should try NOT to have too many master_xxx branches, and rather have everything in master. Otherwise it becomes unmanageable. I work on master, and I also worked on master_june24 (to merge it in master, essentially ready), but I would avoid more branches. (What I try to do is to have several PRs onmaster in parallel, and in case I need one inside another then I do include them, but always trying to make sure that they can be merged to master one after the pther). This specific PR makes sense against master, I would put it on master. And it does NOT have much to do with the other hel work you are doing in fortran (IIUC), because you are reducing from two fortran helicity calculations to one. Here I am just modifying how many events are used in the cudacpp helicity computation, I am not changing one vs two computations in cudacpp nor am I touching what is done in fortran. So I would really merge it in master.
Uh? This I really do not understand, I should see what you have done. But again, I'd prefer to merge this to master first and then look at the rest. All the tests I am doing for CMS use this now. Thanks :-) |
Approve the merge of the CI, and master_goodhel and ... they will be in master. The multiplication of master_xxx is just that I have parallel work on therefore multiple PR (which for some weird reason takes a while to be merged). While master_june24 was special the others are quite "normal branch" (and the master_ prefix is just my typical naming of branch indicating from which branch I'm originating from.
No problem, I will do a master_goodhel_limhel branch then to show you what I want to do for it (but if we merge master_goodhel in master --which should be uncontroversial-- first such that I can do a master_limhel branch). |
STARTED AT Thu Aug 8 07:40:53 PM CEST 2024 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu Aug 8 08:05:50 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu Aug 8 08:14:09 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu Aug 8 08:22:29 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu Aug 8 08:25:13 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Thu Aug 8 08:27:57 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common ENDED(6) AT Thu Aug 8 08:30:45 PM CEST 2024 [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean ENDED(7) AT Thu Aug 8 08:42:09 PM CEST 2024 [Status=0]
…heft madgraph5#833, but gqttq madgraph5#845 crash is fixed) STARTED AT Thu Aug 8 08:42:09 PM CEST 2024 (SM tests) ENDED(1) AT Fri Aug 9 12:48:36 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Fri Aug 9 12:58:52 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
…an-14 on Mac githib-hosted runners (fix madgraph5#971)
… merging git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…MS nvcc without nvtx PR madgraph5#966) into hel
I have updated this PR including
This should be ready to merge, when the CI tests have passed |
Hi @oliviermattelaer thanks again for the discussion yesterday. Ok now I understand better what you mean. We could reuse the same parameter as a float in the following way
Note also that I opened #975 about possibly adding channelid to helicity filtering (which now instead is done on all channels). |
The CI completed, there are the usual expected 3 failures Hi @oliviermattelaer as discussed yesterday, I now self-merge this. Some followup work remains to do in
I also still need to review and merge your comprehensive helicity changes in fortran (which I think is #955 and related work) Thanks Andrea |
… mac madgraph5#974, nvcc madgraph5#966) into june24 Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/counters.cc epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/fbridge.cc epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f epochX/cudacpp/gg_tt.mad/SubProcesses/counters.cc epochX/cudacpp/gg_tt.mad/SubProcesses/fbridge.cc NB: here I essentially fixed gg_tt.mad, not CODEGEN, which will need to be adjusted a posteriori with a backport In particular: - Note1: patch.P1 is now taken from june24, but will need to be recomputed git checkout HEAD CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 - Note2: I need to manually port some upstream/master changes in auto_dsig1.f to smatrix_multi.f, which did not yet exist
… mac madgraph5#974, nvcc madgraph5#966) into pptt Fix conflicts: epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 (take HEAD version, must recompute) epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f (fix manually)
… mac madgraph5#974, nvcc madgraph5#966) into prof
… mac madgraph5#974, nvcc madgraph5#966) into grid
…dgraph5#960, mac madgraph5#974, nvcc madgraph5#966) into cmsdy Fix conflict in tlau/fromgridpacks/parseGridpackLogs.sh (use the currenmt cmsdy version: git checkout b125b65 tlau/fromgridpacks/parseGridpackLogs.sh)
…rge with hel madgraph5#960, mac madgraph5#974, nvcc madgraph5#966) into cmsdyps
This is a WIP PR to improve timers and cudacpp helicity computation in madevent #958 and #546