Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DY+3 jets cross section decreases by a factor 10 when changing vector size from 16384 to 32? #959

Open
valassi opened this issue Aug 8, 2024 · 9 comments

Comments

@valassi
Copy link
Member

valassi commented Aug 8, 2024

I am investigating why CMS does not see a SIMD speedup in DY+3jets ie #943

Specifically I am investigating why the 'Fortran overhead' is still so large and why it varies with SIMD flags in c++ ie #958

One of the points here, as discussed in #546, is trying to understand if vector_size has an impact on speed and particularly on the speed of the 'Fortran overhead'.

On itgold91 (Intel Gold, nproc=32, no GPU) I had initially done some tests with vector_size=16384. Now I am doing the same tests with vector_size=32. I recretaed the gridpacks (which was faster because the c++ builds were in ccache).

However, the first very surprising effect is that the cross section has varied by one order of magnitude?

< START: Wed Aug  7 08:53:34 PM CEST 2024
---
> START: Thu Aug  8 09:05:29 AM CEST 2024
290,299c290,295
< INFO:  Idle: 39,  Running: 32,  Completed: 1753 [ current time: 21h07 ] 
< INFO:  Idle: 38,  Running: 32,  Completed: 1754 [ current time: 21h07 ] 
< INFO:  Idle: 31,  Running: 32,  Completed: 1761 [  3.5s  ] 
< INFO:  Idle: 18,  Running: 32,  Completed: 1774 [  6.7s  ] 
< INFO:  Idle: 10,  Running: 32,  Completed: 1782 [  9.8s  ] 
< INFO:  Idle: 0,  Running: 29,  Completed: 1795 [  13.4s  ] 
< INFO:  Idle: 0,  Running: 19,  Completed: 1805 [  16.4s  ] 
< INFO:  Idle: 0,  Running: 11,  Completed: 1813 [  19.6s  ] 
< INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  21.9s  ] 
< sum of cpu time of last step: 5h59m08s
---
> INFO:  Idle: 20,  Running: 31,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 19,  Running: 32,  Completed: 1773 [ current time: 09h11 ] 
> INFO:  Idle: 0,  Running: 31,  Completed: 1793 [  3s  ] 
> INFO:  Idle: 0,  Running: 14,  Completed: 1810 [  6.1s  ] 
> INFO:  Idle: 0,  Running: 0,  Completed: 1824 [  7.8s  ] 
> sum of cpu time of last step: 3h13m56s
302c298
<      Cross-section :   1.069e+04 +- 27.84 pb
---
>      Cross-section :   139.4 +- 0.6185 pb
308c304
< combination of events done in 0.41349196434020996 s 
---
> combination of events done in 0.3937568664550781 s 
405,408c401,404
< 26470.32user 549.43system 16:19.34elapsed 2758%CPU (0avgtext+0avgdata 1119336maxresident)k
< 251256inputs+31085672outputs (6402major+219118214minor)pagefaults 0swaps
< END: Wed Aug  7 09:09:53 PM CEST 2024
< ELAPSED: 979 seconds
---
> 11662.02user 249.60system 8:06.23elapsed 2449%CPU (0avgtext+0avgdata 76640maxresident)k
> 289688inputs+22151728outputs (3133major+71995085minor)pagefaults 0swaps
> END: Thu Aug  8 09:13:35 AM CEST 2024
> ELAPSED: 486 seconds

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

Or, is it that this process diverges and one has to put some physics cuts? @choij1589 do you have some physics cuts in your DY+3jets?

Thanks Andrea

@valassi
Copy link
Member Author

valassi commented Aug 8, 2024

(Then there is always the possibility that I am doing something really stupid...)

@choij1589
Copy link
Collaborator

@valassi No, in the old version presented in the meetings only mll = 50 cut is implied for CMS in run_card - but since xqcut had been specified there would be automatic cuts on ptj due to auto_ptj_mjj?

@oliviermattelaer
Copy link
Member

@choij1589 Do you have ickkw=1? (not sure of the support of xqcut if ickkw=0)
But yes in that mode you do have sensible cuts.

Now given the small statistical error this is likely not related to the cuts (if you have singularity, the error on the cross-section should be bigger than that).

@oliviermattelaer is this something you would expect because of problems covering the phase space with large vector sizes? Or does this sound like a bug?

I would not expect an issue at the cross-section level (more at the distribution level) when setting large vector size (which is issue that the "channelId" branch is fixing). So this sounds to that you identify a new bug here.

@choij1589
Copy link
Collaborator

@oliviermattelaer yes ickkw=1 turned on for CMS default (though not thinking about merging in this step)

@oliviermattelaer
Copy link
Member

oliviermattelaer commented Aug 8, 2024

So I have made some pure fortran comparison for the following script:

generate p p  > l+ l- 3j
output 
launch
set mmll 50
set ickkw 1
set xqcut 20

For different branch of MG5aMC (no plugin impact here, pure mg5 fortran):

  • LTS: 125.1 +- 0.2817 pb
  • 3.6.0: 124.1 +- 0.3227 pb (This I consider compatible with LTS)
  • gpucpp_for360: 122.7 +- 0.3098 pb (note failing in the selection of helicity for the helicity recycling)
  • gpucpp_goodhel: 122.74 ± 0.309 (note failing in the selection of helicity for the helicity recycling + failing in combine events)
  • gpucpp: 122.7 +- 0.3098 pb (note failing in the selection of helicity for the helicity recycling)

So conclusions here:

  • the cross-section is modified between gpucpp and "official" branch of MG5aMC.
  • they are an issue for determining helicity recycling (only for the 3j case actually) which is related to the black box and the clustering of such events (see the log of the helicity recycling code):
 cluster.f: Error. Invalid combination.
 error for clustering
At line 669 of file cluster.f
Fortran runtime error: Index '-1562495544' of dimension 1 of array 'imap' below lower bound of 1

At this stage, It is not clear if:

  • what I found is related or not to your 10x factor
  • if the helicity recycling crash is related to the missmatch of cross-section (it does fall back to no-helicity recycling)
    So for the moment, I will investigate the helicity recycling point and then the missmatch in the MG5aMC status issue: MG5aMC status #867 MG5aMC status #867 (comment)

@valassi
Copy link
Member Author

valassi commented Aug 8, 2024

So this sounds to that you identify a new bug here.

Ouf that does not sound good :-(

Again I might be doing something silly, but I repeated the test and I seem to see this again. Maybe @choij1589 you can also try in your setup please? Run once with vector_size=16384 and once with 32 in the runcards.

@oliviermattelaer note a few points

  • unless I am mistaken this is all using fortran MEs. So if it is a bug it is probably in the madevent fortran infrastructure, nothing to do with cudacpp?
  • I think (am not extra sure) that I am not using any VECSIZE_USED here, so hopefully this is not to blame
  • I noticed in generation from gridpacks (not in gridpack creation, which is what I describe here) that many more events are processed when a large vector size is set... this is not surprising, but what I mean is that the results above are in any case based on very different of MC points
  • last, again I find it strange that event generation from gridpack prints out a zero cross section (gridpack run.sh produces events but prints out a zero cross section (clarified: this will always happen) #716) , I am not sure if this is normal?

valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 8, 2024
…, with vector_size=32 instead of 16384

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack

The build is faster (C++ is all in cacche).

But the cross section has changed by a factor 10?? See madgraph5#959
@choij1589
Copy link
Collaborator

@valassi Sorry I missed this issue, I will come back after testing with different vector_size configurations.

@choij1589
Copy link
Collaborator

Hi @valassi , I have check DY+3j and the cross sections are different,
vector_size=32: 1357 \pm 1.473 pb
vector_size=16384: 1369 \pm 2.333 pb

but within 5 sigma difference ~ (1369-1357)/(1.473+2.333) so not sure it's actual different xsecs like in DY+4j(as @oliviermattelaer quoted?)

@oliviermattelaer
Copy link
Member

Those two sounds indeed compatible with each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants