Skip to content

Optimized video reader backend

Compare
Choose a tag to compare
@fmassa fmassa released this 07 Nov 16:33
efb0b26

This minor release introduces an optimized video_reader backend for torchvision. It is implemented in C++, and uses FFmpeg internally.

The new video_reader backend can be up to 6 times faster compared to the pyav backend.

  • When decoding all video/audio frames in the video, the new video_reader is 1.2x - 6x faster depending on the codec and video length.
  • When decoding a fixed number of video frames (e.g. [4, 8, 16, 32, 64, 128]), video_reader runs equally fast for small values (i.e. [4, 8, 16]) and runs up to 3x faster for large values (e.g. [32, 64, 128]).

Using the optimized video backend

Switching to the new backend can be done via torchvision.set_video_backend('video_reader') function. By default, we use a backend based on top of PyAV.

Due to packaging issues with FFmpeg, in order to use the video_reader backend one need to first have ffmpeg available on the system, and then compile torchvision from source using the instructions from https://github.com/pytorch/vision#installation

Deprecations

In torchvision 0.4.0, the read_video and read_video_timestamps functions used pts relative to the video stream. This could lead to unaligned video-audio being returned in some cases.

torchvision now allow to specify a pts_unit argument in those functions. The default value is 'pts' (with same behavior as before), and the user can now specify pts_unit='sec', which produces consistently aligned results for both video and audio. The 'pts' value is deprecated for now, and kept for backwards-compatibility.

In the next release, the default value of pts_unit will change to 'sec', so that calling read_video without specifying pts_unit returns consistently aligned audio-video results. This will require users to update their VideoClips checkpoints, which used to store the information in pts by default.

Changelog