A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
- on Windows: compile with Visual Studio Community
- on Linux: run
chmod +x make.sh
and./make.sh path/to/kernel.ptx
- Generate a
.ptx
file from your application; this works only with an Nvidia GPU. With the OpenCL-Wrapper, you can simply uncomment#define PTX
insrc/opencl.hpp
and compile and run. A filekernel.ptx
is created, containing the PTX assembly code. - Run
bin/PTXprofiler.exe path/to/kernel.ptx
. For FluidX3D for example, this table is generated:
kernel name |flops (float int bit )|copy |branch|cache (load store)|memory (load cached store)
--------------------------------|---------------------------|------|------|--------------------|---------------------------
initialize | 283 129 61 93| 33| 6| 0 0 0| 135 35 0 100
stream_collide | 363 261 35 67| 23| 2| 0 0 0| 153 77 0 76
update_fields | 160 56 37 67| 21| 2| 0 0 0| 93 77 0 16
voxelize_mesh | 170 91 34 45| 40| 11| 84 48 36| 37 36 0 1
transfer_extract_fi | 460 0 221 239| 122| 63| 0 0 0| 180 80 20 80
transfer__insert_fi | 483 0 247 236| 115| 47| 0 0 0| 180 80 20 80
transfer_extract_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
transfer__insert_rho_u_flags | 47 0 39 8| 23| 1| 0 0 0| 68 34 0 34
- For each OpenCL/CUDA kernel, instructions are counted and listed:
- GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as
flops
, but also listed separately asfloat
,int
andbit
. - Data movement operations are listed under
copy
. - Branches are listed under
branch
. - Total shared/local memory (L1 cache) accesses in Byte are listed under
cache
, with separate counters forload
andstore
. - Total global memory (VRAM) accesses in Byte are listed under
memory
, with separate counters forload
,cached
(load from VRAM or L2 cache) andstore
.
- GPUs compute floating-point, integer and bit manipulation operations on the same ALUs, so they are counted combined as
- You can use the counted
flops
andmemory
accesses, together with the measured execution time of the kernel, to place it in a roofline model diagram.
- Matrix/tensor operations are not yet supported.
- Non-unrolled loops are only counted for one iteration, but may be executed multiple times, duplicating the number of actually executed instructions inside the loop.