-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Fix 826" (actually: fix iconfig-channel mapping) #852
Conversation
Hi @oliviermattelaer thanks a lot, I am running a few tests. I have a couple of commits to add for fixing clang formatting. More importantly, the build fails for cuda
I am having a look how to fix this too |
…h5#852 for susy_gg_t1t1 issue madgraph5#826) into susy
… in Olivier's patch PR madgraph5#852 for issue madgraph5#826 in susy_gg_t1t1
…raph5#826 is NOT fixed) after Olivier's PR madgraph5#852 patch ./tmad/teeMadX.sh -susyggt1t1 +10x -makeclean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Olivier, thanks for asking the review. Various issues below.
Minor issue: clang formatting needed tweaks. Fixed in my #853.
Less minor issue: the std::map implementation does not work on cuda in device code. I tried to add constexpr but that is even worse. In the end I added a simple implementation based on arrays. Also in #853.
Major issue however: I still see no cross section. What is this PR fixing, exactly? Is there some tests that you and Stefan did which was fixed by this? Can you document it please? I see no improvement from this so far.
Related to that: I am puzzled by the naming, and by the code produced. The naming says "diag_to_channel", but actually the diag in the map goes up to 6, while there are only 6 diagrams. Is this Fortran numbering starting at 1? Do we need to remove 1? Specifically, your code gives
https://github.com/valassi/madgraph4gpu/blob/ade95e86ffe24c9adbb2a624b78f7ef6dca43bd1/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/coloramps.h
__device__ std::map<int, int> diag_to_channel = {
{ 2, 0 },
{ 3, 1 },
{ 4, 2 },
{ 5, 3 },
{ 6, 4 }, // note: a trailing comma in the initializer list is allowed
};
Note that my coe equivalently gives this
https://github.com/valassi/madgraph4gpu/blob/a04a900b0ae6c36a103c85f10cb53c4e3f2c4382/epochX/cudacpp/susy_gg_t1t1.mad/SubProcesses/P1_gg_t1t1x/coloramps.h
__device__ constexpr int diag_to_channel[7] = {
-1, // 0 --> None
-1, // 1 --> None
+0, // 2 --> 0
+1, // 3 --> 1
+2, // 4 --> 2
+3, // 5 --> 3
+4 // 6 --> 4
};
which fixes the std::map device, but gives wrong results. The fact that we need 7 entries looks weird to me.
Thanks!
Andrea
Thanks Andrea, (I thought that you were in Holliday) Yes in my case in a haswell node, the code was compiling and producing (wrong) cross-section. (I did run it from the "madgraph" interface). The (wrong) cross-section was similar to the (wrong) cross-section on my mac that should be related to an ordering issue of the coupling that we still have to understand/fix. The channelId indeed starts to count at 1 since 0 is used for "no multi-channel", so indeed if you want to use an array, you need to go up to seven. But in itself this is not directly related to fortran numbering scheme. Now even if this does not merge #826 for you, I guess we should try to merge your #853 and then try to understand what it is still crashing for you? |
Hi Olivier, thanks to you. No actually I am here also on Monday and probably Tuesday (but I would really prefer to get other things done rather than debug this).
If you have a test that fails, can you try to make a reproducer please? Otherwise it is really difficult to communicate. Tell me the exact sequence that fails and where. Then I can try to find a haswell node for instance if it only fails there. (Similarly, it would be great if you could tell me if my test scripts fail for you and why, and then we open a ticket to debug. Anyway, it is now quite obvious for me why my script fails: I call it for channel 1, and CPPProcess has no enhancement of channel 1, see below. I would like to fix that first, before any sigfpe crashes, which may actually be related. This is to say: if you do not manage to run the tmad scripts now, not a big problem, but eventually we should try to fix these).
Anyway, let's go to the heart of this. I think I understand the problems a bit better (but I am still not sure if your patch makes sense or not). In my opinion there is certainly also a problem with fortran numbering. To get the complete picture, there are many arrays and indices and objects involved here. I am talking about susy gg to t1t1.
Anyway, as I read I kind of understand better your patch. It seems to work under CASE2. That is to say, channels are called 2-6 in Fortran and in input.txt, but then they must be translated to 0-4 in icolamp. @oliviermattelaer can you confirm that this is the case? Then in that case to solve my #826 I need to test channel 2 rather than channel 1. (Remember my tmad tests just run madevent with input.txt on one specific channel, and I have always used channel 1). (And then I also need to understand if G directories are called G1-G5 or G2-G6).
Before any merge, I'd like to understand the logic of the code and numbering, as above. Then coding it will become easier ;-) By the way my PR 853 still has some CI failures, but maybe this is trivial clang formatting. I must have a look though.
I repeat, I have two issues
Ah ok this is something else/new, interesting. Should we add this to all cuda/cpp compile options in your opionion? Thanks again and sorry for the long post, but its'a c omplex issue with many variables. Andrea |
PS Looking more at the code in Fortran (this is master before any changes)
This is AMP2(MAXAMPS)
This is AMP(2-6) so channels seem to be 2-6 (CASE2), using AMP(3-7), remember AMP(1-2) are from the first diagram
This is the numerator which seems to use AMP2(CHANNEL), so I would say that channel is indeed 2-6 (CASE2)
So I would indeed conclude that case2 is right, and I ned to use channel>1 to solve #826 |
Hi again, more debugging but it does not really advance. I moved my tmad scripts from channel 1 to elsewhere
I also realise that in any case the code must be better dcocumented in the diag_to_channel. With my current implementation, there are 7 array elements. Maybe better use only 6 (the number of diagrams), and keep the fact that fortran and c indexing are different. I mean this thing does two mappings: it skips one diagram with no channel, AND it also maps fortran to c indexing. Or maybe I have a bug there too, to be checked. |
Ok I guessed another piece I think. The fortran part of mg5amc uses TWO DIFFERENT NOTATIONS internally
Specific example here
On current uupstream master
The above is a printout which comes from fortran madevent So it is neither case1 nor case2, it is something in between. And by the way ggttgg was working fine all the time |
Note, channel id printed in XSECTION by madX.sh is what auto_dsig1.f prints as CHANNEL_ID
This means that a 1 in the input.txt file may become a 2 in CHANNEL_ID in auto_dsig1.f:
|
Ok so "1" in input.txt is iconfig. This is driver.f
Notye: here it says diag_number. But this is NOT the diag number. |
Anyway this diag_number only happens in addmothers.f, so nnot very relevant
|
And channel comes from subdiag, which is somehow related to inconfig
|
Ok now I doscovered this one. Probably the relevant one (for ggttgg it is interesting but much longer)
The question is, is this something that we should mimick in cudacpp? @oliviermattelaer is this confsub what your patch is emulating in diag_to_channel? Anyway: I think we really need to clarify which variables exist and what their meaning is, before going on... And we should try to use the same names in fortran and cudacpp... |
Last comment for the evening. I updated the branch for #853.
|
I propose to clean up the code to document the mappings better as follows, see #853 In coloramps.h
In CPPProcess.cc for instance
I think that this is needed and is in my understanding what @oliviermattelaer you were correctly fixing. Does th eabove make sense? (At least: did I get the meaning of the numbers right?) Thanks. The problem is however that, in the specific case of susy_gg_t1t1, this makes no difference, and I still get a zero cross section with iconfig=1 (ie channelId=2), there must be something else to fix. And I have not tried codegen backports so I am not sure what it will give to other processes |
I had a quick look. I am not sure this is possible in C++ at compile time. But valgrind should be good enough for similar runtime checks. I have run some tests for this in #855. |
…SIGFPE madgraph5#855, and add volatile in aloha_functions.f to try to fix it The SIGFPE crash madgraph5#855 does seem to disappear in ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean However, there is now a DIFFERENT issue, an lhe file mismatch between fortran and cpp (madgraph5#856) This is probably due to the iconfig/channel mapping issue reported by Olivier in madgraph5#852
Hi Andrea, Following our discussion, I have changed the name to match your convention and put your change in the auto-generated code mechanism.
With this change, the cross-section of G1 is still zero (same for G2) and the cross-section reported by G3 is wrong. If possible, would you be able to comment on the above change? (In particular if you found them so bad that you do not want us to merge it in master.) PS: I will check the testsuite to check the failure (I guess a lot of clang formatting will not go through) |
I have tried to fix the clang formatting
|
Let me reply point by point to your "need change":
Ok I have moved to the auto-generation generating the arrays mode
So, this does fix some SIGFPE (especially for G3) and likely fix some missmatch in the color of events that you report in other bug (but this needs to be confirmed).
Given our conversation today, I have changed the naming scheme to yours.
That's what i have kept since this is the range of channel_Id. And I would like to avoid to use a channel_IdC (as long as, they are no array associated to it) --which seems to be the case-- So I think that this is good for your to re-review it. (sorry to send so many answer to this thread in a row) |
…g AS-IS Olivier's patches from the latest fix_826 branch for PR madgraph5#852 The gg_ttgg test still crashes (rotxxx madgraph5#855?) ./tmad/madX.sh -ggttgg -iconfig 104 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fce5ec23860 in ??? 1 0x7fce5ec22a05 in ??? 2 0x7fce5e854def in ??? 3 0x44b5ff in ??? 4 0x4087df in ??? 5 0x409848 in ??? 6 0x40bb83 in ??? 7 0x40d1a9 in ??? 8 0x45c804 in ??? 9 0x434269 in ??? 10 0x40371e in ??? 11 0x7fce5e83feaf in ??? 12 0x7fce5e83ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3913008 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The susy_gg_t1t1 test also still crashes (see madgraph5#826?), this looks like the same crash as ggttgg above ./tmad/madX.sh -susyggt1t1 -iconfig 2 -makeclean *** (2-none) EXECUTE MADEVENT_CPP x1 (create events.lhe) *** Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7f9f03423860 in ??? 1 0x7f9f03422a05 in ??? 2 0x7f9f03054def in ??? 3 0x43809f in ??? 4 0x40581f in ??? 5 0x4067b1 in ??? 6 0x408c71 in ??? 7 0x40a0a9 in ??? 8 0x444fdf in ??? 9 0x42bb38 in ??? 10 0x40371e in ??? 11 0x7f9f0303feaf in ??? 12 0x7f9f0303ff5f in ??? 13 0x403844 in ??? 14 0xffffffffffffffff in ??? ./tmad/madX.sh: line 387: 3907179 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} The gqttq test also still crashes intermittently, i.e. only on the second execution (madgraph5#845?) ./tmad/teeMadX.sh -gqttq +10x -fltonly -makeclean ./tmad/teeMadX.sh -gqttq +10x -fltonly Executing ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. Backtrace for this error: 0 0x7fbafa623860 in ??? 1 0x7fbafa622a05 in ??? 2 0x7fbafa254def in ??? 3 0x7fbafad24034 in ??? 4 0x7fbafa9a1575 in ??? 5 0x7fbafad20c89 in ??? 6 0x7fbafad2abfd in ??? 7 0x7fbafad30491 in ??? 8 0x43008b in ??? 9 0x431c10 in ??? 10 0x432d47 in ??? 11 0x433b1e in ??? 12 0x44a921 in ??? 13 0x42ebbf in ??? 14 0x40371e in ??? 15 0x7fbafa23feaf in ??? 16 0x7fbafa23ff5f in ??? 17 0x403844 in ??? 18 0xffffffffffffffff in ??? ./madX.sh: line 387: 3922797 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x1_cudacpp > /tmp/avalassi/output_gqttq_x1_cudacpp' failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @oliviermattelaer thanks for asking the re-review :-)
I have done more tests, as discussed yesterday at the meeting.
The results are in PR #870 (which is NOT meant to be merged, it just includes code and test logs for reference).
The summary: NO GOOD.
- This patch still does not fix the pending issues I am aware of. In particular, it does NOT fix the rotxxx crash that we discussed yesterday,
- That said I am convinced that a patch like this is needed to fix other issues (in particular MAJOR ISSUE: color mismatch fortran/cpp in LHE file for iconfig 104 in SM gg_ttgg (channel/iconfig mapping AND icolamp issues) #856), but I would FIRST apply my patch Fix SIGFPE crash (855) in rotxxx by adding volatile in aloha_functions.f #857 to get rid of the tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855 crash. (Note by the way that I am unable to even see the issue MAJOR ISSUE: color mismatch fortran/cpp in LHE file for iconfig 104 in SM gg_ttgg (channel/iconfig mapping AND icolamp issues) #856 unless I first fix the rotxxx crash).
- And again, I do not see ANY issue directly fixed here. You mentioned a haswell crash but I have no preproducer for that, so I cannot even check if that is fixed independently of rotxxx.
Details in d8df7ce
- my tests of susy_gg_t1t1 and gg_ttgg both crash in rotxxx (tmad test crashes in rotxxx (SIGFPE erroneous arithmetic operation) #855)
- my tests of gq_ttq crash in sigmaKin (Intermittent FPE "erroneous arithmetic operation" in gqttq tmad test (for cpp512z with FPTYPE=f only: fix it with 'volatile') #845)
My conclusion:
- Please let's first merge Fix SIGFPE crash (855) in rotxxx by adding volatile in aloha_functions.f #857, and then let's see what happens. I have done all sort of tests including valgrind for rotxxx, and I am out of new things to try out. And I already spent days and weekends on that, a solution exists, let's use it. Sounds ok?
Thanks
Andrea
… in Olivier's patch PR madgraph5#852 for issue madgraph5#826 in susy_gg_t1t1
…graph5#826 is NOT fixed) after Olivier's first two commits in PR madgraph5#852 ./tmad/teeMadX.sh -susyggt1t1 +10x -makeclean (PS: it is NOW quite obvious to me that PR madgraph5#852 should be tested against different issues like ggttgg madgraph5#856, NOT against susy madgraph5#826) (PS: will now therefore slowly move to different tests - first, however, recover my coloramps.h proposal in susy and Olivier's later changes)
…n and tmad logs Revert "[color] temporarely regenerate susy_gg_t1t1 again, just to check all is ok - will revert" This reverts commit 28a0177. Revert "[color] rerun tmad test for susy_gg_t1t1: still no cross section (madgraph5#826 is NOT fixed) after Olivier's first two commits in PR madgraph5#852" This reverts commit 6baf004.
…Formatting.sh ** NB: the code is now at the same level as Olivier's PR madgraph5#852, but also includes my relevant commit history from PR madgraph5#853 ** ** NB: in particular, susy_gg_t1t1.mad includes my proposal for the coloramps.h mappings ** Revert "[susy] in CODEGEN/checkFormatting.sh, add coloramps.h to the list of files checked with clang formatting" This reverts commit 08f4ab2.
I have now almost completed MR #873, which includes all of your commits here in PR #852, with additional modifcations and fixes. I suggest to close this one, or otherwise leave it open until we merge PR #873 (at that point, this one will normally be automatically closed... I am not 100% sure because the submodule may complicate this, butnormally I took care of this I hope). I suggest we move the discussion to #873 instead (and mg5amcnlo/mg5amcnlo#115) Thanks again for the help on this! PS I took the liberty of renaming this, because I think this will not fix 826. It will normally fix 856 (together with my other patches), instead. |
I will finally merge this now at last. Thanks Olivier for the patch and for the patience discussing further changes to this. Further fixes and cosmetic improvements to this patch are included in #877, such as
Note a couple of points
Thanks again Olivier |
…adgraph5#852 and color2 PR madgraph5#877, instead of color PR madgraph5#873) into fpe
Thanks Andrea! |
This is a proposal to fix the segfault issue for issue #826.
This will not fix the missmatch of cross-section (due to the coupling ordering missmatch) --but maybe we can open a different issue for that missmatch--.
Note that the issue was an out-of-bound access within the CPP part of the code. Related to the fact that the color information is designed to be for the channel information but that channelId contains the diagram number (which can be not identical).
Finally this require also a change in the handling of the second exporter on the mainstream madgraph to have an easy access to additional information to be sure that both cpp and madevent use the same convention.