More efficient unweighting using the GPU #642

hageboeck · 2023-04-14T13:24:51Z

Here, the unweighting in gg --> ggtt is improved. By computing the maximum event weight for each batch on the GPU, the unweighting function can much earlier reject candidate events based on their weights.

This speeds up the FORTRAN part by almost 3x with -O2 and 2x with -O3.

Some details might still need to be ironed out, so keeping this as draft for now.

Here is a diff between before/after on -O2 -g:

   Iteration  1   Mean: 0.1613E-03 Abs mean: 0.1613E-03   Fluctuation:  0.244E-05   0.132E+00    39.0%
   1    0.1613E-03 0.1613E-03 +- 0.2439E-05      7.74
  Relative summed weights:
@@ -215,13 +180,13 @@
                        Cross sec =  0.1613E-03
              Chi**2 per DoF.     =      0.0000
  -------------------------------------------------------------------------------
- Found        89923  events.
- Wrote          870  events.
+ Found         1475  events.
+ Wrote         1053  events.
  Actual xsec    1.6128154403965636E-004
  Correct abs xsec    1.6128154403965693E-004
- Event xsec    1.6128154403965384E-004
+ Event xsec    1.6128154403966026E-004
  Events wgts > 1:           32
- % Cross section > 1:    1.6703807456466780E-006   1.0356924318854401     
+ % Cross section > 1:    1.5015103183693524E-006  0.93098706817943189     
 -------------------------------------------------
 ---------------------------
  Results Last   1 iters: Integral =   0.1613E-03
@@ -232,6 +197,6 @@
 ---------------------------
  Status   9.9999999999999995E-007           2           1
 __CudaRuntime: calling cudaDeviceReset()
- [COUNTERS] PROGRAM TOTAL          :   16.0084s
- [COUNTERS] Fortran Overhead ( 0 ) :   15.4073s
- [COUNTERS] CudaCpp MEs      ( 2 ) :    0.6011s for   278528 events => throughput is 4.63E+05 events/s
+ [COUNTERS] PROGRAM TOTAL          :    6.2821s
+ [COUNTERS] Fortran Overhead ( 0 ) :    5.6987s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :    0.5834s for   278528 events => throughput is 4.77E+05 events/s

And with -O3:

- [COUNTERS] PROGRAM TOTAL          :   8.46s
- [COUNTERS] Fortran Overhead ( 0 ) :   7.84s
+ [COUNTERS] PROGRAM TOTAL          :    4.69s
+ [COUNTERS] Fortran Overhead ( 0 ) :    5.6987s

std::copy implementations are supposed to use memmove where possible (dependending on the template parameters). Therefore, a manual check of the copied types is unnecessary. When fortran type and C++ type are identical, std::copy automatically decays to memcpy.

Add kernels and bridge code to compute event weights on GPU. Using the weights of Jacobians and PDF from Fortran, the GPU can compute the total event weight in device memory. A second kernel computes the maximum of each batch, and returns this to the host.

- For each batch, compute the maximum event weight on the GPU - Transfer this into a common block for the unweighting steps - This allows for rejecting events a lot earlier (instead of writing them to tmp)

Now that the max event weight can be computed in each batch, the unweight fudge factor for accepting / rejecting an event can be chosen much closer to one. Here we go on the conservative side, where we accept about twice as many events than go to the final sample.

hageboeck requested a review from oliviermattelaer April 14, 2023 13:24

hageboeck self-assigned this Apr 14, 2023

hageboeck force-pushed the maxWeightGPU branch from b6573ad to aefa64f Compare May 26, 2023 15:52

valassi mentioned this pull request Jun 6, 2023

Do we have a timing overhead again? #685

Closed

hageboeck added 4 commits August 30, 2024 15:12

[cuda/gg_ttgg] Add a first version for max weight computation to bridge.

2181d6c

- For each batch, compute the maximum event weight on the GPU - Transfer this into a common block for the unweighting steps - This allows for rejecting events a lot earlier (instead of writing them to tmp)

hageboeck force-pushed the maxWeightGPU branch from 6aed8a8 to 82963b7 Compare August 30, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient unweighting using the GPU #642

More efficient unweighting using the GPU #642

hageboeck commented Apr 14, 2023 •

edited

Loading

More efficient unweighting using the GPU #642

Are you sure you want to change the base?

More efficient unweighting using the GPU #642

Conversation

hageboeck commented Apr 14, 2023 • edited Loading

hageboeck commented Apr 14, 2023 •

edited

Loading