This minor release introduces an optimized video_reader backend for torchvision. It is implemented in C++, and uses FFmpeg internally.

The new video_reader backend can be up to 6 times faster compared to the pyav backend.

When decoding all video/audio frames in the video, the new video_reader is 1.2x - 6x faster depending on the codec and video length.
When decoding a fixed number of video frames (e.g. [4, 8, 16, 32, 64, 128]), video_reader runs equally fast for small values (i.e. [4, 8, 16]) and runs up to 3x faster for large values (e.g. [32, 64, 128]).

Using the optimized video backend

Switching to the new backend can be done via torchvision.set_video_backend('video_reader') function. By default, we use a backend based on top of PyAV.

Due to packaging issues with FFmpeg, in order to use the video_reader backend one need to first have ffmpeg available on the system, and then compile torchvision from source using the instructions from https://github.com/pytorch/vision#installation

Deprecations

In torchvision 0.4.0, the read_video and read_video_timestamps functions used pts relative to the video stream. This could lead to unaligned video-audio being returned in some cases.

torchvision now allow to specify a pts_unit argument in those functions. The default value is 'pts' (with same behavior as before), and the user can now specify pts_unit='sec', which produces consistently aligned results for both video and audio. The 'pts' value is deprecated for now, and kept for backwards-compatibility.

In the next release, the default value of pts_unit will change to 'sec', so that calling read_video without specifying pts_unit returns consistently aligned audio-video results. This will require users to update their VideoClips checkpoints, which used to store the information in pts by default.

Changelog

[video reader] inception commit (#1303) 31fad34
Expose frame-rate and cache to video datasets (#1356) 85ffd93
Expose num_workers in VideoClips (#1359) 02a8c0a
Fix randomresized params flaky (#1282) 7c9bbf5
Video transforms (#1353) 64917bc
add _backend argument to init() of class VideoClips (#1363) 7874374
Video clips workers (#1369) 0982395
modified code of io.read_video and io.read_video_timestamps to intepret pts values in seconds (#1331) 17e355f
add metadata to video dataset classes. bug fix. more robustness (#1376) 49b01e3
move sampler into TV core. Update UniformClipSampler (#1408) f0d3daa
remove hardcoded video extension in kinetics400 dataset (#1418) 929c81d
Fix hmdb51 and ucf101 typo (#1420) b13931a
fix a bug related to audio_end_pts (#1431) 1258bb7
expose more io api (#1423) e48b958
Make video transforms private (#1429) 79daca1
extend video reader to support fast video probing (#1437) ed5b2dc
Better handle corrupted videos (#1463) da89dad
Temporary fix to remove ffmpeg from build time (#1475) ed04dee
fix a bug when video decoding fails and empty frames are returned (#1506) 2804c12
extend DistributedSampler to support group_size (#1512) 355e9d2
Unify video backend (#1514) 97b53f9
Unify video metadata in VideoClips (#1527) 7d509c5
Fixed compute_clips docstring (#1543) b438d32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized video reader backend

Using the optimized video backend

Deprecations

Changelog