Adaptive Streaming

This document is obsolete.

For the current specification, see: Media Source Extensions

Here is a (rough and incomplete) proposal for doing adaptive streaming using open video formats. Key components of the proposal:

Videos are served as separate, small chunks.
Accompanying manifest files provide metadata.
The user-agent parses manifests and switches between stream levels.
An API provides QOS metrics and enables custom switching logic.

Introduction

Today, most video on the internet is delivered as progressive download (e.g. Youtube). While this works fine in most cases, there are limitations as it comes to more advanced uses of video:

Long-form video (long downloads, waste of bandwidth if user doesn't watch)
Live/DVR video (hard to do as progressive download, unstable)
Delivery to mobile devices (lots of buffering due to changing network conditions)

Adaptive streaming aims to solve these issues by:

Offering multiple versions of a video, at different bitrates / quality levels (e.g. from 100kbps to 2 mbps).
Transporting the video not as one big file, but as separate, distinct chunks (e.g. by cutting up the video in small files, or by using range-requests).
Allowing user-agents to seamlessly switch between quality levels (e.g. based upon changing device or network conditions), simply by downloading the next chunk from a different level.

There's currently three widely used implementations of adaptive HTTP streaming:

Microsoft Smooth Streaming, used by Silverlight.
Adobe HTTP Dynamic Streaming, used by Flash.
Apple HTTP Live Streaming, used by Quicktime X.

There's of course still the dedicated streaming protocols (e.g. RTSP over UDP). Adaptive HTTP streaming does not aim to replace those, as there are various cases that are much better served with a real streaming approach. However, adaptive HTTP streaming could become the approach for the majority of online video delivery, for many of the same reasons that made progressive HTTP so popular:

It is easy to understand and implement.
It builds upon existing HTTP infrastructure.
It centralizes all intelligence (and control) in the client.

Note: good article on what matters in your encoding parameters for HTTP adaptive streaming

Chunks

Every chunk should be a valid video file (header, videotrack, audiotrack). Every chunk should also contain at least 1 keyframe (at the start). This implies every single chunk can be played back by itself.

Beyond validity, the amount of metadata should be kept as small as possible (single-digit kbps overhead).

Codec parameters that can vary between the different quality levels of an adaptive stream are:

The datarate, dimensions (pixel+display) and framerate of the video track.
The datarate, number of channels and sample frequency of the audio track.

In order for quality level switches to occur without artifacts, the start positions of all chunks should align between the various quality levels. If this isn't the case, user-agents will display artifacts (ticks, skips, black) when a quality level switch occurs. Syncing should not be a requirement though. This will allow legacy content to be used for dynamic streaming with little effort (e.g. remuxing or using a smart server) and little issues (in practive, most keyframes are aligned between different transcodes of a video).

In its most low-tech form, chunks can be stored as separate files-on-disc on a webserver. This poses issues around transcoding (no ecosystem yet) and file management (not everybody loves 100s of files). There are at least two solutions:

A serverside module accepts chunk requests, pulls the correct GOP(s) from a video and wraps the necessary metadata.
Clients can accept and process video without headers and use HTTP range-requests to directly get the video data.

Manifests

The M3U8 manifest format that Apple specified is adopted. Generally, both an overall manifest (linking to the various quality levels) and a quality level manifest (linking to the various stream levels) are used. (Though, especially for live streaming, a single quality level may be used).

Here's an example of such an overall manifest. It specifies three quality levels, each with its own datarate, codecs and dimensions:

#EXTM3U 
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1280000,CODECS="vp8,vorbis",RESOLUTION=240x135
http://example.com/low.m3u8 
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2560000,CODECS="vp8,vorbis",RESOLUTION=640x360
http://example.com/mid.m3u8 
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=7680000,CODECS="vp8,vorbis",RESOLUTION=1280x720
http://example.com/hi.m3u8

Here's an example manifest for one such quality level. It contains a full URL listing of all chunks for this quality level:

#EXTM3U
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-TARGETDURATION:10
#EXTINF:10,
http://media.example.com/segment1.webm
#EXTINF:10,
http://media.example.com/segment2.webm
#EXTINF:10,
http://media.example.com/segment3.webm
#EXT-X-ENDLIST

The video framerate, audio sample frequency and number of audio channels cannot be listed here according to the specs. In formats like WebM, Ogg and MPEG-TS (the container Apple specifies), this can be retrieved during demuxing.

The #EXT-X-ENDLIST tag defines the end of a video. If this tag is present, the manifest is supposed to be fixed and the client will not re-load it during playback. If this tag is not present (generally only during live streams), the client should periodically re-fetch the manifest to get additional chunks.

The M3U8 playlist format also provides a mechanism for stream interruptions (#EXT-X-DISCONTINUITY) and for encryption (#EXT-X-KEY). Moreover, regular ID3 tags can be used to enrich the manifest with metadata.

User-agents

The root manifest serves as the single, unique reference point for a adaptive stream. Therefore, user agents in theory need solely its URL to playback the stream. Here's an example for loading a root manifest: through the *src* attribute of the <video> tag in an HTML page:

<video width="480" height="270" src="http://example.com/video.m3u8">
  <a href="http://example.com/video_low.webm">Download the video</a>
</video>

In this variation, the manifest is loaded through the <source> tag, to provide fallback logic:

<video width="480" height="270" >
  <source src="http://example.com/video-webm.m3u8" type="manifest/webm">
  <source src="http://example.com/video-apple.m3u8" type="manifest/m2ts">
  <source src="http://example.com/video-plain.webm" type="video/webm">
  <a href="http://example.com/video-plain.webm">Download the video</a>
</video>

Here's another example for loading the manifest; through the *enclosure* element in an RSS feed:

<rss version="2.0">
<channel>
  <title>Example feed</title>
  <link>http://example.com/</link>
  <description>Example feed with a single adaptive stream.</description>
  <item>
    <title>Example stream</title>
    <enclosure length="1487" type="manifest/webm"
      url="http://example.com/video.m3u8" />
  </item>
</channel>
</rss>

Like the manifest parsing, the switching heuristics are upon the user-agent. They can be somewhat of a *secret sauce*. As a basic example, a user-agent can select a quality level if:

The bitrate of the level is < 90% of the server » client downloadRate.
The videoWidth of the level is < 120% of the video element width.
The delta in droppedFrames is < 25% of the delta in decodedFrames for this level.

Since droppedFrames are only known after a level has started playing, it is generally only a reason for switching down. Based upon the growth rate of droppedFrames, a user-agent might choose to blacklist the quality level for a certain amount of time, or discard it altogether for this playback session.

The quality level selection occurs at the start of every chunk URL fetch. Given an array of levels, the user-agent starts with the highest quality level first and then walks down the list. If the lowest-quality level does not match the criteria, the user-agent still uses it (else there would be no video).

A user-agent typically tries to maintain X (3, 10, 20) seconds of video ready for decoding (buffered). If less than X seconds is available, the user-agent runs it quality level selection and requests another chunk.

There is a tie-in between the length of a chunk, the bufferLenght and the speed with which a user-agent adapts to changing conditions. For example, should the bandwidth drop dramatically, 1 or 2 high-quality chunks will still be played from buffer before the first lower-quality chunk is shown. The other way around is also true: should a user go fullscreen, it will take some time until the stream switches to high quality. Lower bufferLenghts increase responsiveness but also increase the possiblity of buffer underruns.

Scripting

Certain user-agents might not offer access to adaptive streaming heuristics. Other user-agents might, or should even do so. The obvious case is a webbrowser supporting the <video> element and a javascript engine:

QOS Metrics

The video element should provides accessors for retrieving quality of service metrics:

downloadRate: The current server-client bandwidth (read-only).
videoBitrate: The current video bitrate (read-only).
droppedFrames: The total number of frames dropped for this playback session (read-only).
decodedFrames: The total number of frames decoded for this playback session (read-only).
height: The current height of the video element (already exists).
videoHeight: The current height of the videofile (already exists).
width: The current width of the video element (already exists).
videoWidth: The current width of the videofile (already exists).

Native adaptive streaming

In case a user-agent has manifest parsing / level switching heuristics built-in, the video element can provides access to the stream levels:

currentLevel: The currently playing stream level.
levels: An array of all stream levels (as parsed from the manifests). Example:

[{
  bitrate: 100000, 
  codecs: 'vp8,vorbis',
  duration: 132,
  height: 180,
  url: manifest_100.m3u8,
  width: 240
},{
  bitrate: 500000,
  codecs: 'vp8,vorbis',
  duration: 132,
  height: 360,
  url: manifest_500.m3u8,
  width: 640
}]

In addition to this, the video element provides an event to notify scripts of changes in the current stream level:

levelChange: the currentLevel attribute has just been updated.

Last, the video element provides functionality to override the user agent's built-in heuristics:

setLevel(level): This method forces the user to switch to another stream level. Invoking this method disables a user-agent's adaptive streaming heuristics. Use *setLevel(-1)* to enable heuristics again.
bufferLength: This attribute controls how much videodata (in seconds) a user-agent should strive to keep buffered.

An important example for bufferLength: a website owner might set this to a very high value to enable viewers on a low bandwidth to wait for buffering and still see a high-quality video.

API adaptive streaming

In case a user-agent does not have manifest parsing and level switching heuristics built-in, the video element can still accomodate adaptive streaming through a small stream API:

appendVideo(url,[range]): fetch the URL and append the video to the currently playing video.
bufferlength (read-only): get the current video buffer amount, in seconds.

The single call allows developers to build adaptive HTTP streaming inside the javascripting layer. Manifest parsing and stream level APIs are not needed. The quality of service metrics are still needed though:

When the video plays fine, chunks fro the same quality level are constantly appended.
When a switch to a different quality level is made, chunks from a different quality level are appended.
When the user seeks to a different position in the video, it's src is simply set to the appropriate chunk at that position.

The bufferlength getter reports upon the actual amount of data that's in the buffer. Scripts cannot presume any video they append to the videoElement is immediately available: the URL has to be resolved, the data has to be fetched and the data has to be demuxed.

The appendVideo call implies that properties such as duration, videoHeight and videoWidth may change during a playback session.

A number of rules have to be set up as to how the concatenation will actually work. For example:

In order to allow user-agents to use a single decoding pipeline, the current video and the one that's appended should contain the same container format and A/V codecs.
Video is appended on a frame-by-frame basis (no bytedata).
[Audio is appended by slightly extending the data and applying a crossfade?]

The optional range parameter instructs the user-agent to only request a certain byterange.

Rationale

Finally, some rationale for the choices made in this proposal. Why chunks and a manifest? Why not, for example, range-requests and <source> tags?

First and foremost, we need a format that works not only in HTML5 browsers, but also in, for example, mobile apps (Android/Blackberry/iOS), desktop players (Miro/Quicktime/VLC) and big screen devices (Roku, Boxee, PS3). Especially for the very small screens (3G network) and large screens (full HD), adaptive streaming is incredibly valuable. Tayloring a solution too much towards the HTML5 syntax and browser environment will hinder broad adoption of an open video standard. Adaptive streaming and HTML5 should work nice together, but adaptive streaming should not be relying on HTML5.

That said:

Providing the low-tech scenario of storing chunks as separate files on the webserver enables adaptive streaming in cases where either the server, the user-agent (apps / players / settops) or the network (firewalls, cellulars) does not support something like range-requests. As an example, implementing adaptive streaming using range-requests in Adobe Flash (e.g. as temporary fallback) would not be possible, since the range-request header is blocked.
Ecosystem partners (CDNs, encoding providers, landmark publishers, etc) are already getting used to ánd building tools around the concept of *chunked* video streams. Examples are log aggregators that roll up chunks servings into a single logline, or encoders that simultaneously build multiple stream levels, chunk them up and render their manifests.
With just the QOS metrics (*downloadRate* and *decodedFrames*) in place, it will be possible to build adaptive-streaming-like solutions (using range-requests) in javascript. In Flash, this same functionality is supported (and very popular) within both Flowplayer and JW Player. True adaptive streaming (continous switching without buffering) won't be possible, but the experience is good enough to suit people that don't have the encoder or browser (yet) to build or playback adaptive streams.

Adaptive Streaming

Contents

Introduction

Chunks

Manifests

User-agents

Scripting

QOS Metrics

Native adaptive streaming

API adaptive streaming

Rationale

Navigation menu

Adaptive Streaming

Introduction

Chunks

Manifests

User-agents

Scripting

QOS Metrics

Native adaptive streaming

API adaptive streaming

Rationale

Navigation menu

Search