Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

valassi · 2024-08-15T15:15:46Z

I am doing a few tests with sample_get_x towards vectorising it, see #963

Apart from the issue reported in #968, I think I identified another two trivial but useful improvements in sample_get_x

One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful
Two, I checked that in a case like CMS DY+3j Understand why CMS sees a speedup in DY+4jets but not DY+3 jets #943, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values

This is WIP to be confirmed.

valassi · 2024-08-15T15:21:37Z

Two, I checked that in a case like CMS DY+3j, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values

This is 291bcf5

valassi · 2024-08-15T15:23:29Z

One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful

This might be this, but is seems too silly to have an effect, maybe it was elsewhere 23a1358

…ode for xbin_min and xbin_max (part1 of madgraph5#969) There is indeed a small but clear improvement CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.5494s [COUNTERS] Fortran Other ( 0 ) : 0.1688s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0669s [COUNTERS] Fortran Random2Momenta ( 3 ) : 3.2830s for 1170103 events => throughput is 2.81E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1061s for 49152 events => throughput is 2.16E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1361s for 16384 events => throughput is 8.31E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0519s for 16384 events => throughput is 3.17E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0649s for 16384 events => throughput is 3.96E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1366s for 1170103 events => throughput is 1.17E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4745s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0257s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.5145s [COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s

… for xmin=0 and xbin_max for xmax=1 (part2 of madgraph5#969) There is indeed another clear and not too small improvement CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.2184s [COUNTERS] Fortran Other ( 0 ) : 0.1695s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0672s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.9293s for 1170103 events => throughput is 2.50E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1094s for 49152 events => throughput is 2.23E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1379s for 16384 events => throughput is 8.42E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0560s for 16384 events => throughput is 3.42E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0707s for 16384 events => throughput is 4.31E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1447s for 1170103 events => throughput is 1.24E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4719s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0267s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.1834s [COUNTERS] OVERALL MEs ( 22 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s

valassi · 2024-08-15T16:13:21Z

See the difference between the default 079207d

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    4.6537s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1603s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.4183s for  1170103 events => throughput is 2.92E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1002s for    49152 events => throughput is 2.04E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1307s for    16384 events => throughput is 7.98E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0505s for    16384 events => throughput is 3.08E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0657s for    16384 events => throughput is 4.01E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1321s for  1170103 events => throughput is 1.13E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4682s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.6191s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

And then the change 1, removing a few xbin calls
b69c61c

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5494s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1688s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.2830s for  1170103 events => throughput is 2.81E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1061s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1361s for    16384 events => throughput is 8.31E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0519s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0649s for    16384 events => throughput is 3.96E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1366s for  1170103 events => throughput is 1.17E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.5145s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

And then caching the xbin values
a6d57a8


CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.2184s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1695s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.9293s for  1170103 events => throughput is 2.50E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1094s for    49152 events => throughput is 2.23E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1379s for    16384 events => throughput is 8.42E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0560s for    16384 events => throughput is 3.42E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0707s for    16384 events => throughput is 4.31E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1447s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1834s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s

I think this could become a small standalone PR. To discuss with @oliviermattelaer

… gg_tt.mad), simplify the code for xbin_min and xbin_max (part1 of madgraph5#969) There is indeed a small but clear improvement CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.5494s [COUNTERS] Fortran Other ( 0 ) : 0.1688s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0669s [COUNTERS] Fortran Random2Momenta ( 3 ) : 3.2830s for 1170103 events => throughput is 2.81E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1061s for 49152 events => throughput is 2.16E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1361s for 16384 events => throughput is 8.31E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0519s for 16384 events => throughput is 3.17E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0649s for 16384 events => throughput is 3.96E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1366s for 1170103 events => throughput is 1.17E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4745s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0257s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.5145s [COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s

… gg_tt.mad), cache xbin_min for xmin=0 and xbin_max for xmax=1 (part2 of madgraph5#969) There is indeed another clear and not too small improvement CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.2184s [COUNTERS] Fortran Other ( 0 ) : 0.1695s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0672s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.9293s for 1170103 events => throughput is 2.50E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1094s for 49152 events => throughput is 2.23E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1379s for 16384 events => throughput is 8.42E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0560s for 16384 events => throughput is 3.42E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0707s for 16384 events => throughput is 4.31E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1447s for 1170103 events => throughput is 1.24E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4719s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0267s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.1834s [COUNTERS] OVERALL MEs ( 22 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s

… gg_tt.mad), comment out dead if/then branches (for warnings that are commented out) This is another minor component of madgraph5#969. It gives almost insignificant performance improvements, but it simplifies the code. CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1574s [COUNTERS] Fortran Other ( 0 ) : 0.1706s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0670s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.8950s for 1170103 events => throughput is 2.47E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.1021s for 49152 events => throughput is 2.08E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1360s for 16384 events => throughput is 8.30E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0518s for 16384 events => throughput is 3.16E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0679s for 16384 events => throughput is 4.15E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1401s for 1170103 events => throughput is 1.20E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4658s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0263s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0347s for 16384 events => throughput is 2.12E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.1227s [COUNTERS] OVERALL MEs ( 22 ) : 0.0347s for 16384 events => throughput is 2.12E-06 events/s

… gg_tt.mad), skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set (part3 of madgraph5#969) This is a very large improvement, but it may be more controversial, hence it is disabled by default... CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1142s [COUNTERS] Fortran Other ( 0 ) : 0.1610s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0670s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.8821s for 1170103 events => throughput is 2.46E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0962s for 49152 events => throughput is 1.96E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1278s for 16384 events => throughput is 7.80E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0485s for 16384 events => throughput is 2.96E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0670s for 16384 events => throughput is 4.09E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1355s for 1170103 events => throughput is 1.16E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4683s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0262s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0348s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.0794s [COUNTERS] OVERALL MEs ( 22 ) : 0.0348s for 16384 events => throughput is 2.13E-06 events/s CUDACPP_RUNTIME_SKIPXBINCHECKS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 3.2969s [COUNTERS] Fortran Other ( 0 ) : 0.1726s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0674s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.0464s for 1170103 events => throughput is 1.75E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0958s for 49152 events => throughput is 1.95E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1298s for 16384 events => throughput is 7.92E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0482s for 16384 events => throughput is 2.94E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0656s for 16384 events => throughput is 4.00E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1412s for 1170103 events => throughput is 1.21E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4685s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 3.2620s [COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s

…5#969 performance improvements in sample_get_x in dsample.f This includes - simplify the code for xbin_min and xbin_max (remove dead code) - cache xbin_min for xmin=0 and xbin_max for xmax=1 - comment out dead if/then branches (for warnings that were already commented out) - optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set The only files that still need to be patched are - 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad is ok)

…graph5#969 improvements in dsample.f) on itscrd90 Code generation completed in 245 seconds Code generation and additional checks completed in 372 seconds

… copy this to gg_tt.mad!], skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set (part3 of madgraph5#969) This is a very large improvement, but it may be more controversial, hence it is disabled by default... CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 4.1142s [COUNTERS] Fortran Other ( 0 ) : 0.1610s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0670s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.8821s for 1170103 events => throughput is 2.46E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0962s for 49152 events => throughput is 1.96E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1278s for 16384 events => throughput is 7.80E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0485s for 16384 events => throughput is 2.96E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0670s for 16384 events => throughput is 4.09E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1355s for 1170103 events => throughput is 1.16E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4683s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0262s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0348s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 4.0794s [COUNTERS] OVERALL MEs ( 22 ) : 0.0348s for 16384 events => throughput is 2.13E-06 events/s CUDACPP_RUNTIME_SKIPXBINCHECKS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp [COUNTERS] PROGRAM TOTAL : 3.2969s [COUNTERS] Fortran Other ( 0 ) : 0.1726s [COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0674s [COUNTERS] Fortran Random2Momenta ( 3 ) : 2.0464s for 1170103 events => throughput is 1.75E-06 events/s [COUNTERS] Fortran PDFs ( 4 ) : 0.0958s for 49152 events => throughput is 1.95E-06 events/s [COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1298s for 16384 events => throughput is 7.92E-06 events/s [COUNTERS] Fortran Reweight ( 6 ) : 0.0482s for 16384 events => throughput is 2.94E-06 events/s [COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0656s for 16384 events => throughput is 4.00E-06 events/s [COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1412s for 1170103 events => throughput is 1.21E-07 events/s [COUNTERS] CudaCpp Initialise ( 11 ) : 0.4685s [COUNTERS] CudaCpp Finalise ( 12 ) : 0.0266s [COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s [COUNTERS] OVERALL NON-MEs ( 21 ) : 3.2620s [COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s

…5#969 performance improvements in sample_get_x in dsample.f This includes - simplify the code for xbin_min and xbin_max (remove dead code) - cache xbin_min for xmin=0 and xbin_max for xmax=1 - comment out dead if/then branches (for warnings that were already commented out) - [NOT YET INCLUDED! I forgot this...] optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set The only files that still need to be patched are - 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad is ok)

…graph5#969 improvements in dsample.f) on itscrd90 [NB: CUDACPP_RUNTIME_SKIPXBINCHECKS is still missing here!] Code generation completed in 245 seconds Code generation and additional checks completed in 372 seconds

…cluding the latest timers/counters and madgraph5#969 sample_get_x speedups [NB: CUDACPP_RUNTIME_SKIPXBINCHECKS still missing!] CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack

… CUDACPP_RUNTIME_SKIPXBINCHECKS patch madgraph5#968 (on top of madgraph5#969) This includes - optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set The only files that still need to be patched are - 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile - 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f ./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1 git checkout gg_tt.mad (Later checked that regenerating gg_tt.mad is ok)

…UDACPP_RUNTIME_SKIPXBINCHECKS set madgraph5#968 : big improvement!) For the cuda backend is now, skipping xbin checks madgraph5#968 Phase space sampling in dy+3j has decreased from 78s to 53s (down by 30%) thanks to removal of xbin checks > [GridPackCmd.launch] GRIDPCK TOTAL 135.1144 > [madevent COUNTERS] PROGRAM TOTAL 130.8140s > [madevent COUNTERS] Fortran PhaseSpaceSampling 53.0338s for 44652395 events > ... > [madevent COUNTERS] CudaCpp MEs 35.4908s for 1769472 events > [madevent COUNTERS] OVERALL NON-MEs 95.3232s > [madevent COUNTERS] OVERALL MEs 35.4908s for 1769472 events For the cuda backend was, including xbin checks but including trivial improvements madgraph5#969 Phase space sampling in dy+3j has decreased from 93s to 78s (down by 15%) thanks to removal of xbin checks < [GridPackCmd.launch] GRIDPCK TOTAL 160.1718 < [madevent COUNTERS] PROGRAM TOTAL 155.8605s < [madevent COUNTERS] Fortran PhaseSpaceSampling 78.1023s for 44652395 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4320s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 120.4290s < [madevent COUNTERS] OVERALL MEs 35.4320s for 1769472 events For the cuda backend was in 2e59eca, without trivial improvements < [GridPackCmd.launch] GRIDPCK TOTAL 176.8891 < [madevent COUNTERS] PROGRAM TOTAL 172.6370s < [madevent COUNTERS] Fortran Random2Momenta 93.2907s for 44651014 events < ... < [madevent COUNTERS] CudaCpp MEs 35.4557s for 1769472 events < [madevent COUNTERS] OVERALL NON-MEs 137.1806s < [madevent COUNTERS] OVERALL MEs 35.4557s for 1769472 events

…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)

…ts - but not yet the latest upstream/master) into cmsdyps Fix conflicts in patch.P1 and patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)

valassi self-assigned this Aug 15, 2024

valassi linked a pull request Aug 15, 2024 that will close this issue

WIP: studies on CMS DY with phase space optimizations #970

Draft

valassi mentioned this issue Aug 15, 2024

Vectorise phase space sampling (port x_to_f_arg to cudacpp with SIMD and GPU support - starting with sample_get_x?) #963

Open

valassi linked a pull request Aug 19, 2024 that will close this issue

WIP: studies on CMS DY #946

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024 •

edited

Loading

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Comments

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024 • edited Loading

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024

valassi commented Aug 15, 2024 •

edited

Loading