A library for layer-by-layer profiling of Pytorch models. Also supports annotating regions, functions and iterators with NVTX ranges, suitable for NSight systems etc.
All metrics are derived using the PyTorch autograd profiler.
Originally based on awwong1, but it has been completely rewritten since.
- Can profile non-leaf layers
- annotate arbitrary regions/functions and profile them
- filter based on node name or % of total time
- sorting
- Colored-by-level printing to make the table easy to read in the terminal
- nvtx support
import torch
import torchvision
import torchprof
model = torchvision.models.alexnet(pretrained=False).cuda()
x = torch.rand([1, 3, 224, 224]).cuda()
with torchprof.profile(model, use_cuda=True) as prof:
_ = model(x)
prof.display(min_pct=0)
+---------------+---------------+---------------+-------------+---------------+---------+ | Node | Self CPU | CPU | Self CUDA | CUDA | Count | |---------------+---------------+---------------+-------------+---------------+---------| | AlexNet | 92.1us (3%) | 3.0ms (99%) | 42.0us (0%) | 11.6ms (100%) | 1 | | ├──classifier | 136.9us (4%) | 681.1us (22%) | 29.7us (0%) | 7.1ms (61%) | 1 | | │ ├──1 | 33.3us (1%) | 102.3us (3%) | 4.1us (0%) | 4.5ms (39%) | 1 | | │ ├──4 | 26.9us (1%) | 77.8us (3%) | 3.1us (0%) | 1.9ms (16%) | 1 | | │ ├──6 | 25.8us (1%) | 74.2us (2%) | 3.1us (0%) | 506.9us (4%) | 1 | | │ ├──2 | 13.9us (0%) | 36.2us (1%) | 4.1us (0%) | 25.6us (0%) | 1 | | │ ├──0 | 22.0us (1%) | 69.6us (2%) | 4.1us (0%) | 24.6us (0%) | 1 | | │ ├──3 | 18.9us (1%) | 57.5us (2%) | 4.1us (0%) | 19.5us (0%) | 1 | | │ └──5 | 13.7us (0%) | 36.1us (1%) | 6.1us (0%) | 18.4us (0%) | 1 | | ├──features | 312.1us (10%) | 2.1ms (69%) | 69.0us (1%) | 4.4ms (37%) | 1 | | │ ├──3 | 23.0us (1%) | 231.4us (8%) | 4.1us (0%) | 1.2ms (10%) | 1 | | │ ├──8 | 21.2us (1%) | 130.6us (4%) | 4.1us (0%) | 809.0us (7%) | 1 | | │ ├──6 | 26.5us (1%) | 291.8us (10%) | 4.1us (0%) | 636.9us (5%) | 1 | | │ ├──10 | 45.1us (1%) | 224.3us (7%) | 4.1us (0%) | 576.5us (5%) | 1 | | │ ├──0 | 28.4us (1%) | 212.6us (7%) | 20.0us (0%) | 548.2us (5%) | 1 | | │ ├──2 | 21.3us (1%) | 69.6us (2%) | 4.1us (0%) | 82.9us (1%) | 1 | | │ ├──5 | 17.5us (1%) | 58.8us (2%) | 4.1us (0%) | 65.5us (1%) | 1 | | │ ├──1 | 18.2us (1%) | 47.5us (2%) | 3.1us (0%) | 59.4us (1%) | 1 | | │ ├──4 | 16.1us (1%) | 39.9us (1%) | 4.1us (0%) | 51.2us (0%) | 1 | | │ ├──9 | 33.5us (1%) | 58.0us (2%) | 4.1us (0%) | 32.8us (0%) | 1 | | │ ├──12 | 42.3us (1%) | 113.4us (4%) | 3.1us (0%) | 31.7us (0%) | 1 | | │ ├──7 | 14.7us (0%) | 36.4us (1%) | 3.1us (0%) | 20.5us (0%) | 1 | | │ └──11 | 23.5us (1%) | 68.9us (2%) | 4.1us (0%) | 19.5us (0%) | 1 | | └──avgpool | 27.5us (1%) | 81.9us (3%) | 4.1us (0%) | 77.8us (1%) | 1 | | aten::zeros | 9.7us (0%) | 23.7us (1%) | 12.4us (0%) | 23.2us (0%) | 1 | +---------------+---------------+---------------+-------------+---------------+---------+
prof.display(min_pct=1)
+---------------+--------------+--------------+-------------+---------------+---------+ | Node | Self CPU | CPU | Self CUDA | CUDA | Count | |---------------+--------------+--------------+-------------+---------------+---------| | AlexNet | 109.8us (2%) | 6.1ms (100%) | | 12.5ms (100%) | 1 | | ├──classifier | 143.5us (2%) | 2.9ms (47%) | | 7.2ms (57%) | 1 | | │ ├──1 | | 2.2ms (37%) | | 4.7ms (37%) | 1 | | │ ├──4 | | 83.4us (1%) | | 1.9ms (15%) | 1 | | │ ├──6 | | 74.2us (1%) | | 499.7us (4%) | 1 | | │ ├──0 | | 100.9us (2%) | | | 1 | | ├──features | 265.2us (4%) | 3.0ms (49%) | | 5.2ms (41%) | 1 | | │ ├──0 | 77.4us (1%) | 1.4ms (23%) | | 1.6ms (13%) | 1 | | │ ├──3 | | 243.4us (4%) | | 1.1ms (9%) | 1 | | │ ├──8 | | 132.5us (2%) | | 775.2us (6%) | 1 | | │ ├──6 | | 166.3us (3%) | | 611.3us (5%) | 1 | | │ ├──10 | | 127.4us (2%) | | 544.8us (4%) | 1 | | │ ├──2 | | 107.7us (2%) | | | 1 | | │ ├──1 | | 77.4us (1%) | | | 1 | | └──avgpool | | 85.9us (1%) | | | 1 | +---------------+--------------+--------------+-------------+---------------+---------+
Turn off the default filtering (shows only nn.Module
and torchprof regions by default)
prof.display(min_pct=1, allow=[], block=[])
+----------------------------------------+--------------+---------------+--------------+---------------+---------+ | Node | Self CPU | CPU | Self CUDA | CUDA | Count | |----------------------------------------+--------------+---------------+--------------+---------------+---------| | AlexNet | 118.3us (4%) | 2.9ms (99%) | | 10.7ms (100%) | 1 | | ├──classifier | 137.7us (5%) | 682.0us (23%) | | 6.9ms (65%) | 1 | | │ ├──1 | 32.8us (1%) | 102.5us (3%) | | 4.4ms (41%) | 1 | | │ │ ├──aten::addmm | 48.7us (2%) | 56.4us (2%) | 4.4ms (41%) | 4.4ms (41%) | 1 | | │ ├──4 | | 76.8us (3%) | | 1.9ms (18%) | 1 | | │ │ ├──aten::addmm | 34.1us (1%) | 40.7us (1%) | 1.9ms (18%) | 1.9ms (18%) | 1 | | │ ├──6 | | 74.0us (3%) | | 498.7us (5%) | 1 | | │ │ ├──aten::addmm | 33.0us (1%) | 39.5us (1%) | 494.6us (5%) | 494.6us (5%) | 1 | | │ ├──aten::zeros | 37.8us (1%) | 90.8us (3%) | | | 7 | | │ │ ├──aten::zero_ | | 43.6us (1%) | | | 7 | | │ ├──0 | | 71.4us (2%) | | | 1 | | │ │ ├──aten::dropout | | 47.4us (2%) | | | 1 | | │ │ │ └──aten::_fused_dropout | 31.1us (1%) | 40.6us (1%) | | | 1 | | │ ├──3 | | 57.2us (2%) | | | 1 | | │ │ ├──aten::dropout | | 38.0us (1%) | | | 1 | | │ │ │ └──aten::_fused_dropout | | 32.5us (1%) | | | 1 | | │ ├──5 | | 35.0us (1%) | | | 1 | | │ ├──2 | | 35.7us (1%) | | | 1 | | ├──features | 273.9us (9%) | 2.0ms (67%) | | 3.6ms (33%) | 1 | | │ ├──3 | | 135.9us (5%) | | 745.5us (7%) | 1 | | │ │ ├──aten::conv2d | | 112.0us (4%) | | 742.4us (7%) | 1 | | │ │ │ └──aten::convolution | | 106.8us (4%) | | 738.3us (7%) | 1 | ...
prof.display(sort_by=["self_cuda_time"], min_pct=0)
+---------------+--------------+--------------+-------------+---------------+---------+ | Node | Self CPU | CPU | Self CUDA | CUDA | Count | |---------------+--------------+--------------+-------------+---------------+---------| | AlexNet | 110.4us (2%) | 6.1ms (100%) | 39.3us (0%) | 12.6ms (100%) | 1 | | ├──features | 265.5us (4%) | 3.0ms (48%) | 67.7us (1%) | 5.2ms (41%) | 1 | | │ ├──0 | 79.8us (1%) | 1.4ms (23%) | 40.4us (0%) | 1.6ms (13%) | 1 | | │ ├──10 | 19.9us (0%) | 127.8us (2%) | 4.1us (0%) | 548.9us (4%) | 1 | | │ ├──5 | 17.3us (0%) | 57.7us (1%) | 4.1us (0%) | 59.4us (0%) | 1 | | │ ├──12 | 16.8us (0%) | 56.7us (1%) | 4.1us (0%) | 28.7us (0%) | 1 | | │ ├──2 | 44.0us (1%) | 107.3us (2%) | 4.1us (0%) | 74.8us (1%) | 1 | | │ ├──11 | 13.8us (0%) | 34.7us (1%) | 4.1us (0%) | 19.5us (0%) | 1 | | │ ├──3 | 24.2us (0%) | 238.5us (4%) | 4.1us (0%) | 1.1ms (9%) | 1 | | │ ├──6 | 22.1us (0%) | 169.6us (3%) | 4.1us (0%) | 612.4us (5%) | 1 | | │ ├──9 | 13.9us (0%) | 34.9us (1%) | 4.1us (0%) | 17.4us (0%) | 1 | | │ ├──4 | 14.9us (0%) | 37.2us (1%) | 3.1us (0%) | 45.1us (0%) | 1 | | │ ├──1 | 28.7us (0%) | 76.7us (1%) | 3.1us (0%) | 58.4us (0%) | 1 | | │ ├──7 | 14.5us (0%) | 35.9us (1%) | 3.1us (0%) | 32.8us (0%) | 1 | | │ └──8 | 20.7us (0%) | 132.3us (2%) | 3.1us (0%) | 791.6us (6%) | 1 | | ├──classifier | 144.0us (2%) | 2.9ms (47%) | 27.6us (0%) | 7.2ms (57%) | 1 | | │ ├──2 | 16.0us (0%) | 39.8us (1%) | 4.1us (0%) | 16.4us (0%) | 1 | | │ ├──1 | 62.7us (1%) | 2.3ms (37%) | 4.1us (0%) | 4.7ms (37%) | 1 | | │ ├──6 | 26.8us (0%) | 76.0us (1%) | 4.1us (0%) | 503.8us (4%) | 1 | | │ ├──0 | 35.9us (1%) | 102.4us (2%) | 4.1us (0%) | 22.5us (0%) | 1 | | │ ├──4 | 28.7us (0%) | 81.9us (1%) | 3.1us (0%) | 1.9ms (15%) | 1 | | │ ├──5 | 14.4us (0%) | 35.9us (1%) | 3.1us (0%) | 15.4us (0%) | 1 | | │ └──3 | 20.1us (0%) | 60.8us (1%) | 3.1us (0%) | 17.4us (0%) | 1 | | └──avgpool | 38.5us (1%) | 79.9us (1%) | 4.1us (0%) | 67.6us (1%) | 1 | | aten::zeros | 9.7us (0%) | 29.0us (0%) | 11.6us (0%) | 28.5us (0%) | 1 | +---------------+--------------+--------------+-------------+---------------+---------+
This method of profiling does not work inside a JIT-ed module - ie. the
submodules inside a module saved with torch.jit.script
are not displayed in the
profile breakdown. I think because the forward
methods are not “late bound”, so
we can’t wrap them on the scripted modules and have the wrapped versions be
invoked.
- [X]
fix up testsReplaced with demos - [X] Add indentation coloring in the table using rich
- [X] merge region profiler stuff into here (but be careful: region_profiler might be used for memory profiling)
- [X] Add a flag to no-op tp.region, tp.func etc.
- [X] Add iterator annotation (@func() on next())
- [ ] Add tp.genfunc to wrap the iterable returned from generator as well
- [ ] add memory profiling (pytorch already has tensor size, shape, code location info)
- [ ] See Kineto orphan events bug: pytorch/pytorch#54267