-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix FPEs and debug nobm_pp_ttW for ATLAS #706
Conversation
…also affected by madgraph5#696 [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/SubProcesses/P1_Sigma_loop_sm_no_b_mass_gg_ttx> make HRDCOD=1 OMPFLAGS=-fopenmp AVX=512y FPTYPE=d HELINL=0 HRDCOD=1 RNDGEN=hasCurand Building in BUILDDIR=. for tag=512y_d_inl0_hrd1_hasCurand (USEBUILDDIR is not set) make -C ../../src -f cudacpp_src.mk make[1]: Entering directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' AVX=512y ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -O3 -std=c++17 -I. -fPIC -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -c Parameters_loop_sm_no_b_mass.cc -o Parameters_loop_sm_no_b_mass.o In file included from Parameters_loop_sm_no_b_mass.cc:15: Parameters_loop_sm_no_b_mass.h: In function ‘const Parameters_loop_sm_no_b_mass_dependentCouplings::DependentCouplings_sv Parameters_loop_sm_no_b_mass_dependentCouplings::computeDependentCouplings_fromG(const fptype_sv&)’: Parameters_loop_sm_no_b_mass.h:291:46: error: ‘COND’ was not declared in this scope 291 | const fptype_sv mdl_GWcft_UV_t_1EPS_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF ); | ^~~~ Parameters_loop_sm_no_b_mass.h:300:138: error: ‘reglog’ was not declared in this scope 300 | const fptype_sv mdl_G_UVt_FIN_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF * reglog( mdl_MT__exp__2 / mdl_MU_R__exp__2 ) ); | ^~~~~~ make[1]: *** [cudacpp_src.mk:241: Parameters_loop_sm_no_b_mass.o] Error 1 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' make: *** [makefile:520: ../../lib/libmg5amc_common.so] Error 2
…5#696 (NB: this was tested in June, but I am only committing this in July) Generation succeeds: ./CODEGEN/generateAndCompare.sh nobm_pp_ttW --mad Builds fail during launch: HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW ccache /usr/local/cuda-12.0/bin/nvcc -O3 -lineinfo -I. -I../../src -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -use_fast_math -std=c++17 -ccbin /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -Xcompiler -fPIC -c gMatrixElementKernels.cu -o gMatrixElementKernels.o ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -O3 -std=c++17 -I. -fPIC -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -c Parameters_loop_sm_no_b_mass.cc -o Parameters_loop_sm_no_b_mass.o In file included from Parameters_loop_sm_no_b_mass.cc:15: Parameters_loop_sm_no_b_mass.h: In function ‘const Parameters_loop_sm_no_b_mass_dependentCouplings::DependentCouplings_sv Parameters_loop_sm_no_b_mass_dependentCouplings::computeDependentCouplings_fromG(const fptype_sv&)’: Parameters_loop_sm_no_b_mass.h:291:46: error: ‘COND’ was not declared in this scope 291 | const fptype_sv mdl_GWcft_UV_t_1EPS_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF ); | ^~~~ Parameters_loop_sm_no_b_mass.h:300:138: error: ‘reglog’ was not declared in this scope 300 | const fptype_sv mdl_G_UVt_FIN_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF * reglog( mdl_MT__exp__2 / mdl_MU_R__exp__2 ) ); | ^~~~~~ make[2]: *** [cudacpp_src.mk:240: Parameters_loop_sm_no_b_mass.o] Error 1
… ttW and ttZ production
… loop_) This fixes madgraph5#696 for this process (COND and reglog are not needed - see madgraph5#697)
…o_b_mass for f in $(find nobm_pp_ttW.mad -name '*loop*'); do git mv $f ${f/loop_sm/sm}; done
…ut loop_) (Note: two .py files have been added in nobm_pp_ttW.mad/bin/internal/ufomodel) This fixes madgraph5#696 in the build for this process (COND and reglog are not needed) However execution fails with IEEE floating point exceptions (FPE madgraph5#701) HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW INFO: Running Survey Creating Jobs Working on SubProcesses INFO: P1_gu_ttxwpd INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements INFO: P1_gd_ttxwmu Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
…xt (for debugging madgraph5#701) cp dump_SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU_MadgraphTest.CompareMomentaAndME_0.txt ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt This is necessary because runTest was failing otherwise pushd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make cleanall; HRDCOD=1 make -j ./runTest.exe Before this succeeds however, it is necessary to rebuild
The runTest now succeeds - i.e. this is not enough to debug madgraph5#701 pushd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make cleanall; HRDCOD=1 make -j ./runTest.exe In fact, lauX.sh still fails HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW INFO: Running Survey Creating Jobs Working on SubProcesses INFO: P1_gu_ttxwpd INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements INFO: P1_gd_ttxwmu Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL To make sure that this is coming from CPP and not CUDA, retry the same without CUDA, it fails as above CUDA_HOME=none HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW INFO: Running Survey Creating Jobs Working on SubProcesses INFO: P1_gu_ttxwpd INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements INFO: P1_gd_ttxwmu Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL NB: note in particular that there are FOUR floating point exceptions IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL The next step will be to merge the fpe branch here and try again...
(this is the merge of fpe as of commit 3658f3f, before fixing madgraph5#730 and madgraph5#731)
…he fixes for madgraph5#701 Now launching fails with a new build error (in cuda) (this was later filed as madgraph5#730 and fixed in a later commit of branch fpe) HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW ccache /usr/local/cuda-12.0/bin/nvcc -Xcompiler -fPIC -c -x cu Parameters_sm_no_b_mass.cc -o Parameters_sm_no_b_mass_cu.o In file included from Parameters_sm_no_b_mass.cc:15: Parameters_sm_no_b_mass.h:26:2: error: #error This non-SM physics process only supports MGONGPU_HARDCODE_PARAM builds (madgraph5#439): please run "make HRDCOD=1" 26 | #error This non-SM physics process only supports MGONGPU_HARDCODE_PARAM builds (madgraph5#439): please run "make HRDCOD=1" | ^~~~~ Since I want to use CPP only, I retry disabling also CUDA: CUDA_HOME=none HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW And... this fixes the IEEE division by zero, but unfortunately it still finds other IEEE exceptions! INFO: Running Survey Creating Jobs Working on SubProcesses INFO: P1_gu_ttxwpd INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements INFO: P1_gd_ttxwmu Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL In summary: the IEEE_DIVIDE_BY_ZERO part of madgraph5#701 has been fixed, but not the other FPEs... There are THREE IEEE FPEs still pending in pp_ttW.mad IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
…raph5#683 (This is the result of cherry-picking ce995d8 and fixing conflicts) The syntax for launching has now changed - must add the trailing .mad CUDA_HOME=none HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW.mad This now fails with the usual three IEEE FPEs (all except division by zero)
(this is the merge of fpe as of commit 49f9d3f, which will be merged to master in madgraph5#723)
… fpe with the fixes for madgraph5#730 and madgraph5#731 Now the CUDA build of nobm_pp_ttW works - but the SIMD execution still fails with three FPEs madgraph5#733 HRDCOD=1 tlau/lauX.sh -CPP nobm_pp_ttW.mad INFO: Running Survey Creating Jobs Working on SubProcesses INFO: P1_gu_ttxwpd INFO: Building madevent in madevent_interface.py with 'CPP' matrix elements INFO: P1_gd_ttxwmu Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
…m_pp_ttW.mad and the three FPEs in madgraph5#733
Quick update: this WIP MR is where I am testing the nobm_pp_ttW process suggested by ATLAS. This was initially affected by three floating point exceptions #701. In MR #723 one of those (IEEE_DIVIDE_BY_ZERO) has been fixed. There are still three FPEs, to be followed up in #733. (PS 25 Nov 2023: FPE #733 was fixed by Stefan ~one month ago. Only #783 was pending and I fixed it yesterday in this MR) |
…heck_sa.cc to debug madgraph5#733 ./check.exe -p 1 8 1 (ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread (ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1 INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it Floating Point Exception (CPU) And even more [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu> gdb --args ./check.exe -p 1 8 1 ... (gdb) run Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1 ... Program received signal SIGFPE, Arithmetic exception. 0x00007ffff7f19ee6 in void mg5amcCpu::VVV1P0_1<mg5amcCpu::KernelAccessWavefunctions<false>, mg5amcCpu::KernelAccessCouplings<false> >(double const*, double const*, double const*, double, double, double*) () from /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/../../lib/libmg5amc_gd_ttxwmu_cpp.so Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-225.el8.x86_64 (gdb) where 0 0x00007ffff7f19ee6 in void mg5amcCpu::VVV1P0_1<mg5amcCpu::KernelAccessWavefunctions<false>, mg5amcCpu::KernelAccessCouplings<false> >(double const*, double const*, double const*, double, double, double*) () from /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/../../lib/libmg5amc_gd_ttxwmu_cpp.so 1 0x00007ffff7f14d3d in mg5amcCpu::calculate_wavefunctions(int, double const*, double const*, double*, unsigned int, double*, double*, double __vector(4)*, int) () from /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/../../lib/libmg5amc_gd_ttxwmu_cpp.so 2 0x00007ffff7f168ba in mg5amcCpu::sigmaKin_getGoodHel(double const*, double const*, double*, double*, double*, bool*, int) () from /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/../../lib/libmg5amc_gd_ttxwmu_cpp.so 3 0x00007ffff7f1a42d in mg5amcCpu::MatrixElementKernelHost::computeGoodHelicities() () from /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/../../lib/libmg5amc_gd_ttxwmu_cpp.so 4 0x000000000040ae4b in main () (gdb)
…s.h to debug madgraph5#733 This clearly shows that the problem here is in the coupling COUP CUDA_HOME=none HRDCOD=1 make -j -f cudacpp.mk ./check.exe -p 1 8 1 (ompnumthreadsNotSetMeansOneThread) DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread (ompnumthreadsNotSetMeansOneThread) omp_get_max_threads() = 1 INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it Compute denom = COUP / x x = { (-757714,0), (-1.08408e+06,0), (-514531,0), (-510399,0) } COUP = { (0,2.122e-314), (0.461905,2.122e-314), (0,7.29112e-304), (0,7.29112e-304) } Floating Point Exception (CPU)
… failures In my current setup on itscrd80 (broken repo with missing mg5amcnlo submodule) this gives *** ERROR! Code generation failed python3: can't open file '/data/avalassi/GPU2023/madgraph4gpuX/MG5aMC/mg5amcnlo/./bin/mg5_aMC': [Errno 2] No such file or directory ... *** ERROR! Code generation failed
…ation and additional checks
Code generation takes ~1 minute (~2minutes in total including additional checks). Source code is 40 MB. There are 40 P3 subdirectories with O(50-100) diagrams each.
This is the most complex subprocess from nobm_pp_eejjjj (there are 856 Feynman diagrams). I checked that there is no need to use the nobmass model, it gives essentially the same code (as far as complexity is concerned). Code generation takes half a minute including additional checks.
NB1: the code builds ok for HRDCOD=0 (so madgraph5#695 does NOT affect this!), and runTest is ok in all P* (the ref files are there) NB2: a full tlau test is now also ok on this process, showing that all issues madgraph5#701 madgraph5#733 madgraph5#783 etc have been fixed tlau/lauX.sh -CPP nobm_pp_ttW.mad ... Cross-section : 1.276 +- 0.007916 pb In summary: this process should now be fully usable by ATLAS and other experiments.
…implicity (it can always be added back) All issues in this process have been fixed and the process seems to be fully functional
Hi @roiser @oliviermattelaer @hageboeck this is now complete as far as I can tell. Even all CI tests are ok. Apart from a number of logs and tests (eg for FPEs #701 and FPEs #733, which I initially identifed here and are now fixed) and things that were removed and cleaned up (in particular, I tested nobm_pp_ttW for ATLAS, I had added that to the repo, but now I removed it and only added it to the CI including codegen), there is really one important fix, the fix for FPEs #783. This is actually a protection that disabled evt-by-evt color choice if channelID==0. By the way also HRDCOD=0 builds of nobm_pp_ttW are ok (#696 does not affect this). Rephrasing: I think that nobm_pp_ttW is now fully debugged and usable by ATLAS. Which was a big blocker in my opinion. Can you please check if it all looks ok for you? In addition, think about when you want to add it (especially @roiser, as this conflicts a bit with your channelId arrays). Maybe the easiest is that we merge Stefan's channelId arrays first, and then I update this one and fix conflicts. Let me know. Thanks |
PS again, the main change I suggest you review is 6ac6cf9 (it seems like a lot of changes, but it is only because the indentation changes! I had to move a lot of code inside an if(){} bracket. |
For this, try "ignore whitespace" when you view the commit. It's indeed a rather small change, so for me it looks alright. |
Hi @roiser @oliviermattelaer I am now merging this as agreed yesterday after the meeting with ATLAS - this should allow ATLAS to test pp_ttW |
… PR madgraph5#706) ** regenerate all processes
…adgraph5#706) into mch Fix conflicts in CODEGEN logs by checking them out from upstream/master git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…pstream/master including PR madgraph5#706)
…pstream/master including PR madgraph5#706) - ok, changes are only in codegen logs
…er including PR madgraph5#706) ** regenerate all processes, removing g*.cu symlinks
…adgraph5#706) into makefiles Fix conflicts in CODEGEN logs by checking them out from upstream/master git checkout upstream/master $(git ls-tree --name-only HEAD */CODEGEN*txt)
…merging in upstream/master including PR madgraph5#706) - ok, changes are only in codegen logs
…adgraph5#706) into jtmk Fix conflicts in CODEGEN log by checking it out from upstream/master git checkout upstream/master gg_tt.mad/CODEGEN_mad_gg_tt_log.txt
WIP: add sm-no_b_mass processes for ATLAS
These are affected by #695 #696 #701
(PS 25 Nov 2023: the first two issues do not apply to nobm_pp_ttW, only to loop-nobm models; the third issue has been fixed)