Here is a (rough and incomplete) proposal for doing adaptive streaming using open video formats. Key components of the proposal:
- Videos are served as separate, small chunks.
- Accompanying manifest files provide metadata.
- The user-agent parses manifests and switches between stream levels.
- An API provides QOS metrics and enables custom switching logic.
Today, most video on the internet is delivered as progressive download (e.g. Youtube). While this works fine in most cases, there are limitations as it comes to more advanced uses of video:
- Long-form video (long downloads, waste of bandwidth if user doesn't watch)
- Live/DVR video (hard to do as progressive download, unstable)
- Delivery to mobile devices (lots of buffering due to changing network conditions)
Adaptive streaming aims to solve these issues by:
- Offering multiple versions of a video, at different bitrates / quality levels (e.g. from 100kbps to 2 mbps).
- Transporting the video not as one big file, but as separate, distinct chunks (e.g. by cutting up the video in small files, or by using range-requests).
- Allowing user-agents to seamlessly switch between quality levels (e.g. based upon changing device or network conditions), simply by downloading the next chunk from a different level.
There's currently three widely used implementations of adaptive HTTP streaming:
- Microsoft Smooth Streaming, used by Silverlight.
- Adobe HTTP Dynamic Streaming, used by Flash.
- Apple HTTP Live Streaming, used by Quicktime X.
There's of course still the dedicated streaming protocols (e.g. RTSP over UDP). Adaptive HTTP streaming does not aim to replace those, as there are various cases that are much better served with a real streaming approach. However, adaptive HTTP streaming could become the approach for the majority of online video delivery, for many of the same reasons that made progressive HTTP so popular:
- It is easy to understand and implement.
- It builds upon existing HTTP infrastructure.
- It centralizes all intelligence (and control) in the client.
Every chunk should be a valid video file (header, videotrack, audiotrack). Every chunk should also contain at least 1 keyframe (at the start). This implies every single chunk can be played back by itself.
Beyond validity, the amount of metadata should be kept as small as possible (single-digit kbps overhead).
Codec parameters that can vary between the different quality levels of an adaptive stream are:
- The datarate, dimensions (pixel+display) and framerate of the video track.
- The datarate, number of channels and sample frequency of the audio track.
In order for quality level switches to occur without artifacts, the start positions of all chunks should align between the various quality levels. If this isn't the case, user-agents will display artifacts (ticks, skips, black) when a quality level switch occurs. Syncing should not be a requirement though. This will allow legacy content to be used for dynamic streaming with little effort (e.g. remuxing or using a smart server) and little issues (in practive, most keyframes are aligned between different transcodes of a video).
In its most low-tech form, chunks can be stored as separate files-on-disc on a webserver. This poses issues around transcoding (no ecosystem yet) and file management (not everybody loves 100s of files). There are at least two solutions:
- A serverside module accepts chunk requests, pulls the correct GOP(s) from a video and wraps the necessary metadata.
- Clients can accept and process video without headers and use HTTP range-requests to directly get the video data.
The M3U8 manifest format that Apple specified is adopted. Generally, both an overall manifest (linking to the various quality levels) and a quality level manifest (linking to the various stream levels) are used. (Though, especially for live streaming, a single quality level may be used).
Here's an example of such an overall manifest. It specifies three quality levels, each with its own datarate, codecs and dimensions:
#EXTM3U #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1280000,CODECS="vp8,vorbis",RESOLUTION=240x135 http://example.com/low.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2560000,CODECS="vp8,vorbis",RESOLUTION=640x360 http://example.com/mid.m3u8 #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=7680000,CODECS="vp8,vorbis",RESOLUTION=1280x720 http://example.com/hi.m3u8
Here's an example manifest for one such quality level. It contains a full URL listing of all chunks for this quality level:
#EXTM3U #EXT-X-MEDIA-SEQUENCE:0 #EXT-X-TARGETDURATION:10 #EXTINF:10, http://media.example.com/segment1.webm #EXTINF:10, http://media.example.com/segment2.webm #EXTINF:10, http://media.example.com/segment3.webm #EXT-X-ENDLIST
The video framerate, audio sample frequency and number of audio channels cannot be listed here according to the specs. In formats like WebM, Ogg and MPEG-TS (the container Apple specifies), this can be retrieved during demuxing.
The #EXT-X-ENDLIST tag defines the end of a video. If this tag is present, the manifest is supposed to be fixed and the client will not re-load it during playback. If this tag is not present (generally only during live streams), the client should periodically re-fetch the manifest to get additional chunks.
The M3U8 playlist format also provides a mechanism for stream interruptions (#EXT-X-DISCONTINUITY) and for encryption (#EXT-X-KEY). Moreover, regular ID3 tags can be used to enrich the manifest with metadata.
The root manifest serves as the single, unique reference point for a adaptive stream. Therefore, user agents in theory need solely its URL to playback the stream. Here's an example for loading a root manifest: through the *src* attribute of the <video> tag in an HTML page:
<video width="480" height="270" src="http://example.com/video.m3u8"> <a href="http://example.com/video_low.webm">Download the video</a> </video>In this variation, the manifest is loaded through the