Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845

Closed
valassi opened this issue May 16, 2024 · 10 comments · Fixed by #874
Assignees

Comments

@valassi
Copy link
Member

valassi commented May 16, 2024

While rerunning tests in PR #841 I came across a new FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad tests.

This is very surprising because I think that there is actually no change in the code (just some makefile changes leading to file name changes). I will try to rerun the test.

Anyway, for reference the issue is here in tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt

...
*** (2-512z) EXECUTE MADEVENT_CPP x10 (create events.lhe) ***
--------------------
CUDACPP_RUNTIME_FBRIDGEMODE = (not set)
CUDACPP_RUNTIME_VECSIZEUSED = 8192
--------------------
81920 1 1 ! Number of events and max and min iterations
0.000001 ! Accuracy (ignored because max iterations = min iterations)
0 ! Grid Adjustment 0=none, 2=adjust (NB if = 0, ftn26 will still be used if present)
1 ! Suppress Amplitude 1=yes (i.e. use MadEvent single-diagram enhancement)
0 ! Helicity Sum/event 0=exact
1 ! Channel number (1-N) for single-diagram enhancement multi-channel (NB used even if suppress amplitude is 0!)
--------------------
Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f2a1a623860 in ???
#1  0x7f2a1a622a05 in ???
#2  0x7f2a1a254def in ???
#3  0x7f2a1ae20acc in ???
#4  0x7f2a1acc4575 in ???
#5  0x7f2a1ae1d4c9 in ???
#6  0x7f2a1ae2570d in ???
#7  0x7f2a1ae2afa1 in ???
#8  0x43008b in ???
#9  0x431c10 in ???
#10  0x432d47 in ???
#11  0x433b1e in ???
#12  0x44a921 in ???
#13  0x42ebbf in ???
#14  0x40371e in ???
#15  0x7f2a1a23feaf in ???
#16  0x7f2a1a23ff5f in ???
#17  0x403844 in ???
#18  0xffffffffffffffff in ???
./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
 PDF set = nn23lo1
 alpha_s(Mz)= 0.1300 running at 2 loops.
 alpha_s(Mz)= 0.1300 running at 2 loops.
 Renormalization scale set on event-by-event basis
 Factorization   scale set on event-by-event basis


 getting user params
Enter number of events and max and min iterations: 
 Number of events and iterations        81920           1           1
@valassi
Copy link
Member Author

valassi commented May 16, 2024

Very strange. I have rerun the test and the FPE has disappeared. Closing as not reproducible.

@valassi valassi closed this as completed May 16, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected

STARTED  AT Thu May 16 01:24:16 AM CEST 2024
(SM tests)
ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

The new issue madgraph5#845 is the following
+Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
+
+Backtrace for this error:
+#0  0x7f2a1a623860 in ???
+#1  0x7f2a1a622a05 in ???
+#2  0x7f2a1a254def in ???
+madgraph5#3  0x7f2a1ae20acc in ???
+madgraph5#4  0x7f2a1acc4575 in ???
+madgraph5#5  0x7f2a1ae1d4c9 in ???
+madgraph5#6  0x7f2a1ae2570d in ???
+madgraph5#7  0x7f2a1ae2afa1 in ???
+madgraph5#8  0x43008b in ???
+madgraph5#9  0x431c10 in ???
+madgraph5#10  0x432d47 in ???
+madgraph5#11  0x433b1e in ???
+madgraph5#12  0x44a921 in ???
+madgraph5#13  0x42ebbf in ???
+madgraph5#14  0x40371e in ???
+madgraph5#15  0x7f2a1a23feaf in ???
+madgraph5#16  0x7f2a1a23ff5f in ???
+madgraph5#17  0x403844 in ???
+madgraph5#18  0xffffffffffffffff in ???
+./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
+ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…ll close it as not reproducible

./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
@valassi
Copy link
Member Author

valassi commented May 30, 2024

Note: similar issues have resurfaced in susy_gg_t1t1, being debugged in #826

@valassi valassi reopened this Jun 3, 2024
@valassi
Copy link
Member Author

valassi commented Jun 3, 2024

I have found again this SIGFPE in gqttq for FPTYPE=f, while running with the code using Olivier's patch #850 for the susy xsec mismatch #825.

My impression (or hope) is that this is the same issue as #855, i.e. a SIGFPE in fortran aloha_functions.f that can be fixed with volatile (see PR #857). I will test that too.

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
…adgraph5#825) is fixed as expected, but a SIGFPE crash in gqttq (madgraph5#845) reappears

Also still pending: missing xsec in susyggt1t1 (madgraph5#826), LHE mismatch for FPTYPE=f in heftggbb (madgraph5#833)

STARTED  AT Mon Jun  3 10:16:45 AM CEST 2024
(SM tests)
ENDED(1) AT Mon Jun  3 02:56:24 PM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Mon Jun  3 03:06:33 PM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
@valassi valassi changed the title FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad test Intermittent FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad test Jun 3, 2024
@valassi valassi changed the title Intermittent FPE "Floating-point exception - erroneous arithmetic operation" in gqttq tmad test Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in random color selection within sigmakin) Jun 3, 2024
@valassi
Copy link
Member Author

valassi commented Jun 3, 2024

I have debugged this further.

First point, this is intermittent. Sometimes the code succeeds, sometimes the code fails (rerunning the same executable multiple times). Maybe half half, maybe less.

Second, this is NOT RELATED to the other SIGFPE #855 in rotxxx. So it will not be fixed by #857.

This crash happens deep inside cudacpp, within the color. I managed to create a gdb trace after rebuilding all with -g. This is in 19a2e0c from PR #860

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp

and

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> gdb ./madevent_cpp 
...
(gdb) run < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98db1 in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1193
1193                if( okcol )
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p okcol
$1 = <optimized out>
(gdb) p allrndcol
$3 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p ievt
$4 = <optimized out>
(gdb) p ieppV
$5 = <optimized out>
(gdb) p targetamp
$6 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {
    3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}
(gdb) p neppV
$7 = 16
(gdb) p icolC
$8 = <optimized out>
(gdb) p ncolor
$9 = 4
(gdb) w
Missing arguments.
(gdb) l
1188    #if defined MGONGPU_CPPSIMD
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190    #else
1191                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192    #endif
1193                if( okcol )
1194                {
1195                  allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1196                  break;
1197                }

I suspect that this is related instead to the iconfig-channel mapping issues that @oliviermattelaer investigated in #852 ?

Anyway, keep this open.

To use the debugger, I added these patches

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu> git diff  --no-ext-diff
diff --git a/epochX/cudacpp/gq_ttq.mad/Source/make_opts b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
index e4b87ee6a..6ccc273c1 100644
--- a/epochX/cudacpp/gq_ttq.mad/Source/make_opts
+++ b/epochX/cudacpp/gq_ttq.mad/Source/make_opts
@@ -1,7 +1,7 @@
 DEFAULT_CPP_COMPILER=g++
 DEFAULT_F2PY_COMPILER=f2py3
 DEFAULT_F_COMPILER=gfortran
-GLOBAL_FLAG=-O3 -ffast-math -fbounds-check
+GLOBAL_FLAG=-g -O3 -ffast-math -fbounds-check
 MACFLAG=
 MG5AMC_VERSION=SpecifiedByMG5aMCAtRunTime
 PYTHIA8_PATH=NotInstalled
diff --git a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
index 89da34009..b8fa4e131 100644
--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/cudacpp.mk
@@ -387,6 +387,10 @@ else
   ###override OMPFLAGS = # disable OpenMP MT on all other platforms (default before #575)
 endif
 
+# Debug SIGFPE crash #845
+override OMPFLAGS=
+override OPTFLAGS=-g -O3
+
 #-------------------------------------------------------------------------------
 
 #=== Configure defaults and check if user-defined choices exist for RNDGEN (legacy!), HASCURAND, HASHIPRAND

@valassi valassi self-assigned this Jun 3, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 3, 2024
@valassi
Copy link
Member Author

valassi commented Jun 3, 2024

I retried exacty the same recipe on 19a2e0c

It seems to crash in 1189 instead of 1193? But it looks the same

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp
...
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f7a7cc23860 in ???
#1  0x7f7a7cc22a05 in ???
#2  0x7f7a7c854def in ???
#3  0x7f7a7d2f0d6f in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1189
#4  0x7f7a7d2f7a3d in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#5  0x7f7a7d2fd2d1 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#6  0x7f7a7d2fd2d1 in fbridgesequence_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#7  0x43008b in smatrix1_multi_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#8  0x431c10 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#9  0x432d47 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#10  0x433b1e in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#11  0x44a921 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#12  0x42ebbf in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
#13  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
Floating point exception (core dumped)

and through gdb

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgfortran-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libquadmath-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) where
#0  0x00007ffff7f98d6f in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, 
    allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, 
    allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1189
#1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1)
    at MatrixElementKernels.cc:115
#2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cb50, selhel=0x7fffffc2cb50, 
    mes=0x7fffffc3cb50, channelId=1, rndcol=0x7fffffc9ceb0, rndhel=0x7fffffcbceb0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, 
    this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
#3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbceb0, 
    rndcol=0x7fffffc9ceb0, pchannelId=<optimized out>, mes=0x7fffffc3cb50, selhel=0x7fffffc2cb50, selcol=0x7fffffc1cb50) at fbridge.cc:106
#4  0x000000000043008c in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, 
    hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, 
    out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., 
    vecsize_used=16384) at auto_dsig1.f:618
#5  0x0000000000431c11 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, 
    all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
#6  0x0000000000432d48 in dsigproc_vec (all_p=..., 
    all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, 
    all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, 
    symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, 
    all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
#7  0x0000000000433b1f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, 
    all_out=..., vecsize_used=16384) at auto_dsig.f:327
#8  0x000000000044a922 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d10 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384)
    at dsample.f:208
#9  0x000000000042ebc0 in driver () at driver.f:256
#10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:301
#11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
#12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
#13 0x0000000000403845 in _start ()
(gdb) l
1184              const int ievt = ievt00 + ieppV;
1185              //printf( "sigmaKin: ievt=%4d rndcol=%f\n", ievt, allrndcol[ievt] );
1186              for( int icolC = 0; icolC < ncolor; icolC++ )
1187              {
1188    #if defined MGONGPU_CPPSIMD
1189                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1190    #else
1191                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1192    #endif
1193                if( okcol )
(gdb) p okcol
$1 = <optimized out>
(gdb) p ievt
$2 = <optimized out>
(gdb) p ieppV
$3 = <optimized out>
(gdb) p neppV
$4 = 16
(gdb) p icolC
$5 = <optimized out>
(gdb) p ncolor
$6 = 4
(gdb) p allrndcol
$7 = (const mgOnGpu::fptype *) 0x6300d00
(gdb) p targetamp
$8 = {{3.6287187e-05, 0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 
    0.000313476485, 5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 
    0.00301690097, 9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 
    5.8289319e-05, 0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097, 
    9.26938374e-05, 0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05, 
    0.00402065413, 0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}, {3.6287187e-05, 0.00301690097, 9.26938374e-05, 
    0.000283826666, 0.000752002583, 0.000340251776, 1.23037262e-06, 0.00068957475, 0.000313476485, 5.8289319e-05, 0.00402065413, 
    0.00385404564, 3.3489403e-05, 0.0149808694, 0.000554771861, 0.00240404671}}

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#850

The gg_ttgg test still crashes (rotxxx madgraph5#855?)
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fce5ec23860 in ???
   1  0x7fce5ec22a05 in ???
   2  0x7fce5e854def in ???
   3  0x44b5ff in ???
   4  0x4087df in ???
   5  0x409848 in ???
   6  0x40bb83 in ???
   7  0x40d1a9 in ???
   8  0x45c804 in ???
   9  0x434269 in ???
   10  0x40371e in ???
   11  0x7fce5e83feaf in ???
   12  0x7fce5e83ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f9f03423860 in ???
   1  0x7f9f03422a05 in ???
   2  0x7f9f03054def in ???
   3  0x43809f in ???
   4  0x40581f in ???
   5  0x4067b1 in ???
   6  0x408c71 in ???
   7  0x40a0a9 in ???
   8  0x444fdf in ???
   9  0x42bb38 in ???
   10  0x40371e in ???
   11  0x7f9f0303feaf in ???
   12  0x7f9f0303ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?)
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fbafa623860 in ???
   1  0x7fbafa622a05 in ???
   2  0x7fbafa254def in ???
   3  0x7fbafad24034 in ???
   4  0x7fbafa9a1575 in ???
   5  0x7fbafad20c89 in ???
   6  0x7fbafad2abfd in ???
   7  0x7fbafad30491 in ???
   8  0x43008b in ???
   9  0x431c10 in ???
   10  0x432d47 in ???
   11  0x433b1e in ???
   12  0x44a921 in ???
   13  0x42ebbf in ???
   14  0x40371e in ???
   15  0x7fbafa23feaf in ???
   16  0x7fbafa23ff5f in ???
   17  0x403844 in ???
   18  0xffffffffffffffff in ???
  ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…nd cudacpp.mk to improve the crash dumps

The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb7e1223860 in ???
   1  0x7fb7e1222a05 in ???
   2  0x7fb7e0e54def in ???
   3  0x43809f in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x40581f in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480
   5  0x4067b1 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167
   6  0x408c71 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68
   7  0x40a0a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60
   8  0x444fdf in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172
   9  0x42bb38 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
  ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed

The ggttgg test also clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb141c23860 in ???
   1  0x7fb141c22a05 in ???
   2  0x7fb141854def in ???
   3  0x44b5ff in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x4087df in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
   5  0x409848 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
   6  0x40bb83 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
   7  0x40d1a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
   8  0x45c804 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
   9  0x434269 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
  ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

The gqttq test instead clearly crashes in sigmaKin (madgraph5#845):
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f607ee23860 in ???
   1  0x7f607ee22a05 in ???
   2  0x7f607ea54def in ???
   3  0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190
   4  0x7f607f4ab575 in ???
   5  0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093
   6  0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
   7  0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
   8  0x7f607f613491 in fbridgesequence_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
   9  0x43008b in smatrix1_multi_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
   10  0x431c10 in dsig1_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
   11  0x432d47 in dsigproc_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
   12  0x433b1e in dsig_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
   13  0x44a921 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
   14  0x42ebbf in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
   15  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
  ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#852

The gg_ttgg test still crashes (rotxxx madgraph5#855?)
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fce5ec23860 in ???
   1  0x7fce5ec22a05 in ???
   2  0x7fce5e854def in ???
   3  0x44b5ff in ???
   4  0x4087df in ???
   5  0x409848 in ???
   6  0x40bb83 in ???
   7  0x40d1a9 in ???
   8  0x45c804 in ???
   9  0x434269 in ???
   10  0x40371e in ???
   11  0x7fce5e83feaf in ???
   12  0x7fce5e83ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f9f03423860 in ???
   1  0x7f9f03422a05 in ???
   2  0x7f9f03054def in ???
   3  0x43809f in ???
   4  0x40581f in ???
   5  0x4067b1 in ???
   6  0x408c71 in ???
   7  0x40a0a9 in ???
   8  0x444fdf in ???
   9  0x42bb38 in ???
   10  0x40371e in ???
   11  0x7f9f0303feaf in ???
   12  0x7f9f0303ff5f in ???
   13  0x403844 in ???
   14  0xffffffffffffffff in ???
  ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}

The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?)
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fbafa623860 in ???
   1  0x7fbafa622a05 in ???
   2  0x7fbafa254def in ???
   3  0x7fbafad24034 in ???
   4  0x7fbafa9a1575 in ???
   5  0x7fbafad20c89 in ???
   6  0x7fbafad2abfd in ???
   7  0x7fbafad30491 in ???
   8  0x43008b in ???
   9  0x431c10 in ???
   10  0x432d47 in ???
   11  0x433b1e in ???
   12  0x44a921 in ???
   13  0x42ebbf in ???
   14  0x40371e in ???
   15  0x7fbafa23feaf in ???
   16  0x7fbafa23ff5f in ???
   17  0x403844 in ???
   18  0xffffffffffffffff in ???
  ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 26, 2024
…nd cudacpp.mk to improve the crash dumps

The susyggt1t1 test clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb7e1223860 in ???
   1  0x7fb7e1222a05 in ???
   2  0x7fb7e0e54def in ???
   3  0x43809f in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x40581f in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1480
   5  0x4067b1 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:1167
   6  0x408c71 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:68
   7  0x40a0a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/genps.f:60
   8  0x444fdf in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/Source/dsample.f:172
   9  0x42bb38 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/driver.f:301
  ./tmad/madX.sh: line 387: 3928626 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_susyggt1t1_x1_cudacpp > /tmp/avalassi/output_susyggt1t1_x1_cudacpp' failed

The ggttgg test also clearly crashes in rotxxx (madgraph5#855):
./tmad/madX.sh -ggttgg -iconfig 104 -makeclean^C
  *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) ***
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7fb141c23860 in ???
   1  0x7fb141c22a05 in ???
   2  0x7fb141854def in ???
   3  0x44b5ff in rotxxx_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/DHELAS/aloha_functions.f:1247
   4  0x4087df in gentcms_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1480
   5  0x409848 in one_tree_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:1167
   6  0x40bb83 in gen_mom_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:68
   7  0x40d1a9 in x_to_f_arg_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/genps.f:60
   8  0x45c804 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/Source/dsample.f:172
   9  0x434269 in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:256
   10  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/driver.f:301
  ./tmad/madX.sh: line 387: 3933302 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttgg_x1_cudacpp > /tmp/avalassi/output_ggttgg_x1_cudacpp' failed

The gqttq test instead clearly crashes in sigmaKin (madgraph5#845):
./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean
./tmad/teeMadX.sh -gqttq +10x -fltonly
  Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp'
  Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
  Backtrace for this error:
   0  0x7f607ee23860 in ???
   1  0x7f607ee22a05 in ???
   2  0x7f607ea54def in ???
   3  0x7f607f607008 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1190
   4  0x7f607f4ab575 in ???
   5  0x7f607f603c89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1093
   6  0x7f607f60dbfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
   7  0x7f607f613491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
   8  0x7f607f613491 in fbridgesequence_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
   9  0x43008b in smatrix1_multi_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
   10  0x431c10 in dsig1_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
   11  0x432d47 in dsigproc_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
   12  0x433b1e in dsig_vec_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
   13  0x44a921 in sample_full_
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
   14  0x42ebbf in driver
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:256
   15  0x40371e in main
          at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:301
  ./madX.sh: line 387: 3941122 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
  ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

Conclusion: I would not merge 852 as it does not fix issues yet.
Instead I would merge 857 to fix the rotxxx crash 855 using volatile, and reassess from there...
@valassi valassi changed the title Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in random color selection within sigmakin) Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in sigmakin random color selection - iconfig-channel mapping issues?) Jun 27, 2024
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

I changed the name to indicate that this crash is most likely related to iconfig-channel mapping issues.

I will instead remove "iconfig-channel mapping issues" from the name of #855, which is ONLY about the rotxxx crash, most likely unrelated to iconfig-channel mapping issues.

@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

Note:

  • I can still reproduce this sort of intermittent crash in sigmakin also after fixing rotxxx
  • It seems too erratic and random to be put in the CI: it is not always the second execution, it really is very random (and rare). Maybe randomly the error will show up in the CI too, but there is no way to force it I would say.
  • I investigated this code though valgrind. See valgrind issues #868 (comment). Initially I thought I had invalid reads/writes, but these disappear using a max stack trace. Eventually I got NO ERRORS FROM VALGRIND. So using valgrind to investigate this specific issue seems not useful.

@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

I have almost completed MR #873 which fixes the channelid-iconfig mapping and icolamp issues in #856.

Unfortunately, howver, this des NOT fix this intermittent crash #845.

I have reproduuced it again

cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
make cleanall
make -j FPTYPE=f BACKEND=cpp512z 
./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp

Using '-g' in make_opts and cudacpp.mk, this sometimes crashes as follows

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7f04f8a23860 in ???
#1  0x7f04f8a22a05 in ???
#2  0x7f04f8654def in ???
#3  0x7f04f91f200c in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1193
#4  0x7f04f9096575 in ???
#5  0x7f04f91eec89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1095
#6  0x7f04f91f8bfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
#7  0x7f04f91fe491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
#8  0x7f04f91fe491 in fbridgesequence_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
#9  0x4300eb in smatrix1_multi_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
#10  0x431c70 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
#11  0x432da7 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
#12  0x433b7e in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
#13  0x44a9c1 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
#14  0x42ebdf in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:257
#15  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:302
Floating point exception (core dumped)

@valassi valassi changed the title Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in sigmakin random color selection - iconfig-channel mapping issues?) Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test Jun 28, 2024
@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

I have changed the name (previously "Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (in sigmakin random color selection - iconfig-channel mapping issues?)") because I no longer see a connection to color selection...

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…dgraph5#845

Unfortunately the intermittent crash is still happening (maybe once every 5-10 executions?)
  cd gq_ttq.mad/SubProcesses/P1_gu_ttxu
  make cleanall
  make -j FPTYPE=f BACKEND=cpp512z
  ./madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
 0  0x7f04f8a23860 in ???
 1  0x7f04f8a22a05 in ???
 2  0x7f04f8654def in ???
 3  0x7f04f91f200c in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i._omp_fn.0
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1193
 4  0x7f04f9096575 in ???
 5  0x7f04f91eec89 in _ZN9mg5amcCpu8sigmaKinEPKfS1_S1_S1_PfjS2_S2_PiS3_i
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc:1095
 6  0x7f04f91f8bfd in _ZN9mg5amcCpu23MatrixElementKernelHost21computeMatrixElementsEj
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/MatrixElementKernels.cc:115
 7  0x7f04f91fe491 in _ZN9mg5amcCpu6BridgeIdE12cpu_sequenceEPKdS3_S3_S3_jPdPiS5_b
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/Bridge.h:390
 8  0x7f04f91fe491 in fbridgesequence_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/fbridge.cc:106
 9  0x4300eb in smatrix1_multi_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:618
 10  0x431c70 in dsig1_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig1.f:445
 11  0x432da7 in dsigproc_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:1034
 12  0x433b7e in dsig_vec_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/auto_dsig.f:327
 13  0x44a9c1 in sample_full_
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/Source/dsample.f:208
 14  0x42ebdf in driver
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:257
 15  0x40371e in main
        at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/driver.f:302
Floating point exception (core dumped)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
… now it seems to fail all the time (not just intermittently)

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
1197                if( okcol )
+where
 0  0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
 1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115
 2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cb30, selhel=0x7fffffc2cb30, mes=0x7fffffc3cb30, channelId=1, rndcol=0x7fffffc9ce90, rndhel=0x7fffffcbce90, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
 3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbce90, rndcol=0x7fffffc9ce90, pchannelId=<optimized out>, mes=0x7fffffc3cb30, selhel=0x7fffffc2cb30, selcol=0x7fffffc1cb30) at fbridge.cc:106
 4  0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618
 5  0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
 6  0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
 7  0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327
 8  0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208
 9  0x000000000042ebe0 in driver () at driver.f:257
 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
 13 0x0000000000403845 in _start ()
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…processes, will now start large scale manual tests

Revert "[color] in P1_gu_ttxu, disable OMP and retry debugging madgraph5#845, now it seems to fail all the time (not just intermittently)"
This reverts commit f1e0d42.

Revert "[color] in gq_ttq.mad, add -g to make_opts and cudacpp.mk to debug madgraph5#845"
This reverts commit d88b6d3.
@valassi valassi changed the title Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') Jun 28, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…ff OpenMP, to debug madgraph5#845

As previously observed, this crashes immediately (NB: it only crashes with AVX512 in '512z' mode!)

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit

Program received signal SIGFPE, Arithmetic exception.
0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
1197                if( okcol )
+where
 0  0x00007ffff7f98dbd in mg5amcCpu::sigmaKin (allmomenta=0x7ffff76bf040, allcouplings=0x7ffff7b57040, allrndhel=<optimized out>, allrndcol=0x6300d00, allMEs=0x6310d80, channelId=channelId@entry=1, allNumerators=0x6341000, allDenominators=0x6351080, allselhel=0x6320e00, allselcol=0x6330e80, nevt=16384) at CPPProcess.cc:1197
 1  0x00007ffff7f9fa3e in mg5amcCpu::MatrixElementKernelHost::computeMatrixElements (this=0x6340ee0, channelId=channelId@entry=1) at MatrixElementKernels.cc:115
 2  0x00007ffff7fa52d2 in mg5amcCpu::Bridge<double>::cpu_sequence (goodHelOnly=false, selcol=0x7fffffc1cc70, selhel=0x7fffffc2cc70, mes=0x7fffffc3cc70, channelId=1, rndcol=0x7fffffc9cfd0, rndhel=0x7fffffcbcfd0, gs=0x1d35a68 <strong_+8>, momenta=<optimized out>, this=0x62e0a70) at /usr/include/c++/11/bits/unique_ptr.h:173
 3  fbridgesequence_ (ppbridge=<optimized out>, momenta=<optimized out>, gs=0x1d35a68 <strong_+8>, rndhel=0x7fffffcbcfd0, rndcol=0x7fffffc9cfd0, pchannelId=<optimized out>, mes=0x7fffffc3cc70, selhel=0x7fffffc2cc70, selcol=0x7fffffc1cc70) at fbridge.cc:106
 4  0x00000000004300ec in smatrix1_multi (p_multi=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, hel_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, col_rand=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, channel=1, out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, selected_hel=..., selected_col=..., vecsize_used=16384) at auto_dsig1.f:618
 5  0x0000000000431c71 in dsig1_vec (all_pp=<error reading variable: value requires 2621440 bytes, which is more than max-value-size>, all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig1.f:445
 6  0x0000000000432da8 in dsigproc_vec (all_p=..., all_xbk=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_q2fact=<error reading variable: value requires 262144 bytes, which is more than max-value-size>, all_cm_rap=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, iconf=1, iproc=1, imirror=1, symconf=..., confsub=..., all_wgt=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, imode=0, all_out=<error reading variable: value requires 131072 bytes, which is more than max-value-size>, vecsize_used=16384) at auto_dsig.f:1034
 7  0x0000000000433b7f in dsig_vec (all_p=..., all_wgt=..., all_xbk=..., all_q2fact=..., all_cm_rap=..., iconf=1, iproc=1, imirror=1, all_out=..., vecsize_used=16384) at auto_dsig.f:327
 8  0x000000000044a9c2 in sample_full (ndim=7, ncall=8192, itmax=1, itmin=1, dsig=0x433d70 <dsig>, ninvar=7, nconfigs=1, vecsize_used=16384) at dsample.f:208
 9  0x000000000042ebe0 in driver () at driver.f:257
 10 0x000000000040371f in main (argc=<optimized out>, argv=<optimized out>) at driver.f:302
 11 0x00007ffff743feb0 in __libc_start_call_main () from /lib64/libc.so.6
 12 0x00007ffff743ff60 in __libc_start_main_impl () from /lib64/libc.so.6
 13 0x0000000000403845 in _start ()
+l
1192    #if defined MGONGPU_CPPSIMD
1193                const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
1194    #else
1195                const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
1196    #endif
1197                if( okcol )
1198                {
1199                  allselcol[ievt] = icolC + 1; // NB Fortran [1,ncolor], cudacpp [0,ncolor-1]
1200                  break;
1201                }
+p okcol
$1 = <optimized out>
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…adgraph5#845) by adding 'volatile', again

This no longer crashes

gdb ./madevent_cpp -ex 'set pagination off' -ex 'set confirm off' -ex 'set trace-commands on' \
  -ex 'run < /tmp/avalassi/input_gqttq_x1_cudacpp' -ex where -ex l -ex 'p okcol' -ex quit
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 28, 2024
…FPE crashes madgraph5#845 in 512z/f (and the option to hardcode away OMP)
@valassi
Copy link
Member Author

valassi commented Jun 28, 2024

I have renamed this issue to mention "(for cpp512z with FPTYPE=f only: fix it with 'volatile')"

Indeed, I checked that this only happens for cpp512z with FPTYPE=f. So it clearly looks like a SIMD-specific optimization issue, like those that I fixed with 'volatile' in many other parts of the code. And indeed I just created a patch that fixes the issue

--- a/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
+++ b/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gu_ttxu/CPPProcess.cc
@@ -1190,7 +1190,8 @@ namespace mg5amcCpu
           for( int icolC = 0; icolC < ncolor; icolC++ )
           {
 #if defined MGONGPU_CPPSIMD
-            const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
+            // Add volatile here to avoid SIGFPE crashes in FPTYPE=f cpp512z builds (#845)
+            volatile const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
 #else
             const bool okcol = allrndcol[ievt] < ( targetamp[icolC] / targetamp[ncolor - 1] );
 #endif

By the way note this interesting post on SIMD and float, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90993

This is now fixed in CODEGEN in #874. I think this can be closed when that PR is merged.

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 29, 2024
…n heft madgraph5#833, susy madgraph5#826 and also gqttq madgraph5#845 - but ggttgg madgraph5#856 is fixed)

Note two points:
- gqttq madgraph5#845 is normally intermittent, so it is interesting that it showed up here (even without OMP)
- the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet

STARTED  AT Fri Jun 28 09:14:39 PM CEST 2024
(SM tests)
ENDED(1) AT Sat Jun 29 01:37:39 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Sat Jun 29 01:47:20 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 1, 2024
…ected (failures in heft madgraph5#833 and susy madgraph5#826 - but intermittent gqttq madgraph5#845 is fixed)

Note two points:
- gqttq madgraph5#845 was intermittent, so the fact that it has disappeared could be casual: but I actually think it is fixed
- the tmad CI also shows pptt012j madgraph5#872, but I am not running pptt012j tests in the tmad suite yet

STARTED  AT Sat Jun 29 03:23:34 PM CEST 2024
(SM tests)
ENDED(1) AT Sat Jun 29 07:44:46 PM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Sat Jun 29 07:54:26 PM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 1, 2024
…eplacing madgraph5#873)

Fix conflicts:
	MG5aMC/mg5amcnlo
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/coloramps.h
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_sigmaKin_function.inc
	epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/model_handling.py

In all four cases, simply take the code version from branch color.
In particular, fix the MG5AMC conflict by setting it to ba54a4153 (valassi_icolamp114 in mg5amcnlo/mg5amcnlo#115, before more recent changes)

Note: the content of this branch is now identical to color

git log color --oneline -n5
  93a547f (origin/color, color) [color] ** COMPLETE COLOR ** add a tmad/gitdifftmad.sh for easier diffs of tmad logs
  643466f [color] add a tput/gitdifftput.sh for easier diffs of tput logs
  46356d6 [color] rerun 30 tmad tests on itscrd90 - all as expected (failures in heft madgraph5#833, susy madgraph5#826 and also gqttq madgraph5#845 - but ggttgg madgraph5#856 is fixed)
  2194e83 [color] rerun 102 tput tests on itscrd90 - all ok (after fixing madgraph5#856 in tmad)
  b3046e1 [color] in .github/workflows/testsuite_oneprocess.sh, temporarely reenable bypasses for know issues madgraph5#826 in susy and madgraph5#872 in pp_tt012j - the CI tests should pass now

git diff 93a547f
  [NO DIFF]
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 17, 2024
… FPTYPE=f 512z builds (madgraph5#845) by adding 'volatile', again
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 31, 2024
…expected (failures only in heft madgraph5#833, but susy madgraph5#826 and pptt madgraph5#872 and gqttq madgraph5#845 are fixed)

STARTED  AT Mon Jul 29 10:02:50 PM CEST 2024
(SM tests)
ENDED(1) AT Tue Jul 30 02:28:18 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Tue Jul 30 02:39:01 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_pptt_mad/log_pptt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_pptt_mad/log_pptt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_pptt_mad/log_pptt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 9, 2024
…heft madgraph5#833, but gqttq madgraph5#845 crash is fixed)

STARTED  AT Thu Aug  8 08:42:09 PM CEST 2024
(SM tests)
ENDED(1) AT Fri Aug  9 12:48:36 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Fri Aug  9 12:58:52 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant