Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nprocesses>2 (i.e. beyond mirror processes) in cudacpp to speed up directory handling? #951

Open
valassi opened this issue Aug 5, 2024 · 3 comments

Comments

@valassi
Copy link
Member

valassi commented Aug 5, 2024

This is a followup to the old and recent discussions about nprocesses==1 (or at most nprocesses==2 with mirror processes) in cudacpp.

So far cudacpp always treats one subprocess at a time and splits them explicitly. For instance uux_xxx and uu_xxx and uc_xxx instead of a generic qq_xxx. This has so far allowed an implementation without arrays(nprocesses). From a functionality point of view, so far this works (modulo the few recent tweaks for mirror processes eg #872, which are being sorted out).

From a usability point of view, however, this is a nuisance.

This is a relatively big chunk of work, that touches essentially all the code we have. I open this to have it on the todo list...

PS For the full list of related issues see https://github.com/madgraph5/madgraph4gpu/issues?q=nprocesses
This includes for instance #272, #343, #534, #635...

@oliviermattelaer
Copy link
Member

oliviermattelaer commented Aug 6, 2024

For SIMD, it should be easy to fix the issue since the only thing to do is to be able to link to multiple cpp library simultaneously and fortran can (already) automatically switch between those matrix-element. However if you link to CUDA, this will not work since each of those call will correspond to the a kernel call and therefore you can not scale this method to CUDA.

One method, discussed at the meeting today, was to change the way the Z interaction is handle (at the model level).
In order to make the 285 matrix element much more similar within SU(2) doublet. This should allow to have (subset) of those 285 matrix-element to be identical up to coupling definition (so if this work like in my dream, this will be 12 directories). So this will allow to have a flag like the channel which will swap the coupling in the correct place (for each wrap/thread) in order to have a single kernel. (Note this method will also speed-up build time for "normal" fortran)

Note that putting this approach to the extreme (allowing for "zero" coupling in some of the matrix-element) could have allow to have the exact number of difectory as fortran with still only one matrix file. But this is not a super easy things to implement (i.e. after the release)

@oliviermattelaer
Copy link
Member

oliviermattelaer commented Aug 7, 2024

I have done some additional pre-investigation here.
The coupling idea is not enough to merge all the directories, I have checked for p p > e+ e- j j
and this will reduce the number of directory by a factor slightly higher than two.

They are two reasons that I was missing yesterday

  1. u u~ > u u~ and d d > d d can not be merge because fermion are incoming and anti-fermion outcoming
  2. u u~ > d d~ and u c~ > u c~ have the same number diagram but in a way "permutated" so again more complex that coupling flipping. Those "permuation" could in principle be used (like imirror in a way) but this will complexify quite a lot all the special handling of S/T channel information (and forbid any associated s/t restriction on the diagram).

This is still a nice factor of two but maybe not worthed it. So if we go into that direction, we should also include interaction with zero coupling such that we can merge process like
u u~ > u u~
and
u u~ > d d~
(the fake vertex will allow both to have the same number of diagram and therefore allow the merging, but this needs to be checked on how to do that in practise)

@valassi
Copy link
Member Author

valassi commented Aug 8, 2024

Thanks Olivier!

From a usability point of view, however, this is a nuisance.

* First, compilation time increases, the number of directory increases and so on. 
  • Second, I wonder what is the effect of this on runtime, as there are many many more directories to combine.

Just one comment here. From my recent tests I do not have evidence yet that this causes runtime performance issues. The build is clearly slow, but the event generation seems to have the bottleneck inside madevent fortran (pdfs and random to momenta I guess), not python combine events. (Well for cuda the combine events does appear maybe). So, mainly build time for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants