Optimized video reader backend
This minor release introduces an optimized video_reader
backend for torchvision. It is implemented in C++, and uses FFmpeg internally.
The new video_reader
backend can be up to 6 times faster compared to the pyav
backend.
- When decoding all video/audio frames in the video, the new
video_reader
is 1.2x - 6x faster depending on the codec and video length. - When decoding a fixed number of video frames (e.g. [4, 8, 16, 32, 64, 128]),
video_reader
runs equally fast for small values (i.e. [4, 8, 16]) and runs up to 3x faster for large values (e.g. [32, 64, 128]).
Using the optimized video backend
Switching to the new backend can be done via torchvision.set_video_backend('video_reader')
function. By default, we use a backend based on top of PyAV.
Due to packaging issues with FFmpeg, in order to use the video_reader
backend one need to first have ffmpeg
available on the system, and then compile torchvision from source using the instructions from https://github.com/pytorch/vision#installation
Deprecations
In torchvision 0.4.0, the read_video
and read_video_timestamps
functions used pts
relative to the video stream. This could lead to unaligned video-audio being returned in some cases.
torchvision now allow to specify a pts_unit
argument in those functions. The default value is 'pts'
(with same behavior as before), and the user can now specify pts_unit='sec'
, which produces consistently aligned results for both video and audio. The 'pts'
value is deprecated for now, and kept for backwards-compatibility.
In the next release, the default value of pts_unit
will change to 'sec'
, so that calling read_video
without specifying pts_unit
returns consistently aligned audio-video results. This will require users to update their VideoClips
checkpoints, which used to store the information in pts
by default.
Changelog
- [video reader] inception commit (#1303) 31fad34
- Expose frame-rate and cache to video datasets (#1356) 85ffd93
- Expose num_workers in VideoClips (#1359) 02a8c0a
- Fix randomresized params flaky (#1282) 7c9bbf5
- Video transforms (#1353) 64917bc
- add _backend argument to init() of class VideoClips (#1363) 7874374
- Video clips workers (#1369) 0982395
- modified code of io.read_video and io.read_video_timestamps to intepret pts values in seconds (#1331) 17e355f
- add metadata to video dataset classes. bug fix. more robustness (#1376) 49b01e3
- move sampler into TV core. Update UniformClipSampler (#1408) f0d3daa
- remove hardcoded video extension in kinetics400 dataset (#1418) 929c81d
- Fix hmdb51 and ucf101 typo (#1420) b13931a
- fix a bug related to audio_end_pts (#1431) 1258bb7
- expose more io api (#1423) e48b958
- Make video transforms private (#1429) 79daca1
- extend video reader to support fast video probing (#1437) ed5b2dc
- Better handle corrupted videos (#1463) da89dad
- Temporary fix to remove ffmpeg from build time (#1475) ed04dee
- fix a bug when video decoding fails and empty frames are returned (#1506) 2804c12
- extend DistributedSampler to support group_size (#1512) 355e9d2
- Unify video backend (#1514) 97b53f9
- Unify video metadata in VideoClips (#1527) 7d509c5
- Fixed compute_clips docstring (#1543) b438d32