Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient unweighting using the GPU #642

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

hageboeck
Copy link
Member

@hageboeck hageboeck commented Apr 14, 2023

Here, the unweighting in gg --> ggtt is improved. By computing the maximum event weight for each batch on the GPU, the unweighting function can much earlier reject candidate events based on their weights.

This speeds up the FORTRAN part by almost 3x with -O2 and 2x with -O3.

Some details might still need to be ironed out, so keeping this as draft for now.

Here is a diff between before/after on -O2 -g:

   Iteration  1   Mean: 0.1613E-03 Abs mean: 0.1613E-03   Fluctuation:  0.244E-05   0.132E+00    39.0%
   1    0.1613E-03 0.1613E-03 +- 0.2439E-05      7.74
  Relative summed weights:
@@ -215,13 +180,13 @@
                        Cross sec =  0.1613E-03
              Chi**2 per DoF.     =      0.0000
  -------------------------------------------------------------------------------
- Found        89923  events.
- Wrote          870  events.
+ Found         1475  events.
+ Wrote         1053  events.
  Actual xsec    1.6128154403965636E-004
  Correct abs xsec    1.6128154403965693E-004
- Event xsec    1.6128154403965384E-004
+ Event xsec    1.6128154403966026E-004
  Events wgts > 1:           32
- % Cross section > 1:    1.6703807456466780E-006   1.0356924318854401     
+ % Cross section > 1:    1.5015103183693524E-006  0.93098706817943189     
 -------------------------------------------------
 ---------------------------
  Results Last   1 iters: Integral =   0.1613E-03
@@ -232,6 +197,6 @@
 ---------------------------
  Status   9.9999999999999995E-007           2           1
 __CudaRuntime: calling cudaDeviceReset()
- [COUNTERS] PROGRAM TOTAL          :   16.0084s
- [COUNTERS] Fortran Overhead ( 0 ) :   15.4073s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    0.6011s for   278528 events => throughput is 4.63E+05 events/s
+ [COUNTERS] PROGRAM TOTAL          :    6.2821s
+ [COUNTERS] Fortran Overhead ( 0 ) :    5.6987s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5834s for   278528 events => throughput is 4.77E+05 events/s

And with -O3:

- [COUNTERS] PROGRAM TOTAL          :   8.46s
- [COUNTERS] Fortran Overhead ( 0 ) :   7.84s
+ [COUNTERS] PROGRAM TOTAL          :    4.69s
+ [COUNTERS] Fortran Overhead ( 0 ) :    5.6987s

std::copy implementations are supposed to use memmove where possible
(dependending on the template parameters). Therefore, a manual check of
the copied types is unnecessary.
When fortran type and C++ type are identical, std::copy automatically
decays to memcpy.
Add kernels and bridge code to compute event weights on GPU. Using the
weights of Jacobians and PDF from Fortran, the GPU can compute the total
event weight in device memory.
A second kernel computes the maximum of each batch, and returns this to
the host.
- For each batch, compute the maximum event weight on the GPU
- Transfer this into a common block for the unweighting steps
- This allows for rejecting events a lot earlier (instead of writing
them to tmp)
Now that the max event weight can be computed in each batch, the
unweight fudge factor for accepting / rejecting an event can be chosen
much closer to one. Here we go on the conservative side, where we accept
about twice as many events than go to the final sample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant