Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

valassi · 2024-08-05T14:33:54Z

This is a followup to the old and recent discussions about nprocesses==1 (or at most nprocesses==2 with mirror processes) in cudacpp.

So far cudacpp always treats one subprocess at a time and splits them explicitly. For instance uux_xxx and uu_xxx and uc_xxx instead of a generic qq_xxx. This has so far allowed an implementation without arrays(nprocesses). From a functionality point of view, so far this works (modulo the few recent tweaks for mirror processes eg #872, which are being sorted out).

From a usability point of view, however, this is a nuisance.

First, compilation time increases, the number of directory increases and so on. See Understand why CMS sees a speedup in DY+4jets but not DY+3 jets #943 where @choij1589 reports 285 vs 8 process directories for DY+2 jets, with 44m vs 4m build time for fortran alone.
Second, I wonder what is the effect of this on runtime, as there are many many more directories to combine.

This is a relatively big chunk of work, that touches essentially all the code we have. I open this to have it on the todo list...

PS For the full list of related issues see https://github.com/madgraph5/madgraph4gpu/issues?q=nprocesses
This includes for instance #272, #343, #534, #635...

oliviermattelaer · 2024-08-06T20:59:56Z

For SIMD, it should be easy to fix the issue since the only thing to do is to be able to link to multiple cpp library simultaneously and fortran can (already) automatically switch between those matrix-element. However if you link to CUDA, this will not work since each of those call will correspond to the a kernel call and therefore you can not scale this method to CUDA.

One method, discussed at the meeting today, was to change the way the Z interaction is handle (at the model level).
In order to make the 285 matrix element much more similar within SU(2) doublet. This should allow to have (subset) of those 285 matrix-element to be identical up to coupling definition (so if this work like in my dream, this will be 12 directories). So this will allow to have a flag like the channel which will swap the coupling in the correct place (for each wrap/thread) in order to have a single kernel. (Note this method will also speed-up build time for "normal" fortran)

Note that putting this approach to the extreme (allowing for "zero" coupling in some of the matrix-element) could have allow to have the exact number of difectory as fortran with still only one matrix file. But this is not a super easy things to implement (i.e. after the release)

oliviermattelaer · 2024-08-07T08:00:46Z

I have done some additional pre-investigation here.
The coupling idea is not enough to merge all the directories, I have checked for p p > e+ e- j j
and this will reduce the number of directory by a factor slightly higher than two.

They are two reasons that I was missing yesterday

u u~ > u u~ and d d > d d can not be merge because fermion are incoming and anti-fermion outcoming
u u~ > d d~ and u c~ > u c~ have the same number diagram but in a way "permutated" so again more complex that coupling flipping. Those "permuation" could in principle be used (like imirror in a way) but this will complexify quite a lot all the special handling of S/T channel information (and forbid any associated s/t restriction on the diagram).

This is still a nice factor of two but maybe not worthed it. So if we go into that direction, we should also include interaction with zero coupling such that we can merge process like
u u~ > u u~
and
u u~ > d d~
(the fake vertex will allow both to have the same number of diagram and therefore allow the merging, but this needs to be checked on how to do that in practise)

valassi · 2024-08-08T16:29:48Z

Thanks Olivier!

From a usability point of view, however, this is a nuisance.
* First, compilation time increases, the number of directory increases and so on. 
Second, I wonder what is the effect of this on runtime, as there are many many more directories to combine.

Just one comment here. From my recent tests I do not have evidence yet that this causes runtime performance issues. The build is clearly slow, but the event generation seems to have the bottleneck inside madevent fortran (pdfs and random to momenta I guess), not python combine events. (Well for cuda the combine events does appear maybe). So, mainly build time for now?

valassi mentioned this issue Aug 5, 2024

Understand why CMS sees a speedup in DY+4jets but not DY+3 jets #943

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

valassi commented Aug 5, 2024 •

edited

Loading

oliviermattelaer commented Aug 6, 2024 •

edited

Loading

oliviermattelaer commented Aug 7, 2024 •

edited

Loading

valassi commented Aug 8, 2024

Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

Comments

valassi commented Aug 5, 2024 • edited Loading

oliviermattelaer commented Aug 6, 2024 • edited Loading

oliviermattelaer commented Aug 7, 2024 • edited Loading

valassi commented Aug 8, 2024

valassi commented Aug 5, 2024 •

edited

Loading

oliviermattelaer commented Aug 6, 2024 •

edited

Loading

oliviermattelaer commented Aug 7, 2024 •

edited

Loading