Skip to content

Basics of media containers, muxers and codecs

Alexander Zakusylo edited this page Dec 24, 2019 · 1 revision

Containers, media streams and packets.

A container is a file (or in-memory storage) in one of the standard video/audio formats, usually determined by the file extension: mkv, mov, avi, mp4, mp3, wav.

Containers consist of media streams, stream information is usually stored in container header (or in trailer).

A media stream holds a sequence of data packets representing some media; the most used media stream types are video, audio or subtitles.

The body of a container is an interleaved sequence of media stream packets; packets are normally sorted more-or-less in time order.

Example container composition:

[Header. Streams information:  1.video 2.audio]
[packet for stream 1]
[packet for stream 2]
[packet for stream 2]
[packet for stream 2]
[packet for stream 1]
[packet for stream 2]
....
[Trailer]

Codecs

A packet contains data compressed by a codec. Usually, one video packet corresponds to one encoded video frame; one audio packet (also sometimes called a frame) corresponds to some short duration of the audio.

The body of a data packet is defined by codec specification. Codec is a combination of encoder and decoder. Encoders compress raw data into packets and decoders restore raw audio/video from packets.

Muxing/demuxing and encoding/decoding

The process of extracting media packets from a container is called demuxing. A demuxer reads a container's header, finds the media streams information, and determines which decoders are needed for each stream. When a demuxer reads the media packets and passes them to the corresponding decoders. Video decoder restores frame images from the video packet data; audio decoder restores playable raw sound samples from the audio packet data.

To create a media container from video/audio data, we first initialize a muxer for a specific file format. When we initialize media streams for that muxer, each stream accompanied by an appropriate encoder. Then video frames and audio samples are sent to the stream encoders; encoders form media packets and the muxer appends add packets to the container.