A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Video Metrics

From WHATWG Wiki
Jump to navigation Jump to search

Related HTML WG bugs: http://www.w3.org/Bugs/Public/show_bug.cgi?id=12399 , http://www.w3.org/Bugs/Public/show_bug.cgi?id=14310


Requirements

For several reasons, we need to expose the performance of media elements to JavaScript.

(1) One concrete use case is that content publishers want to understand the quality of their content as being played back by their users and how much a user is actually playing back. For example, if a video always goes into buffering mode after 1 min for all users - maybe there is a problem in the encoding, or the video is too big for the typical bandwidth/CPU combination.

Service providers, especially commercial service providers, need to manage and monitor the performance of their service, for example to detect problems before they cause a deluge of Customer Service calls or identify whether, where and when the service is providing the quality of experience expected by customers.

(2) Also, publishers want to track the metrics of how much of their video and audio files is actually being watched.

(3) A related use case is HTTP adaptive streaming, where an author wants to manually implement an algorithm for switching between different resources of different bandwidth or screen size. For example, if the user goes full screen and the user's machine and bandwidth allow for it, the author might want to switch to a higher resolution video.


Note that whenever bitrates are reported it needs to be clear how the bitrate is calculated. For example if it is an average then average over what time interval. If it is some kind of peak bitrate, then what's the window size over which the peak was calculated or other definition. A raw bitrate alone is not very meaningful.

Further note: Measuring performance of media elements should include error cases. The HTMLMediaElement (and XmlHttpRequest for that matter) are rather light on reporting of network errors (just a boolean). It would be good to get more detailed information about what errors occurred.

Collection of Proposals/Implementations

Mozilla have implemented the following statistics into Firefox:

  • mozParsedFrames - number of frames that have been demuxed and extracted out of the media.
  • mozDecodedFrames - number of frames that have been decoded - converted into YCbCr.
  • mozPresentedFrames - number of frames that have been presented to the rendering pipeline for rendering - were "set as the current image".
  • mozPaintedFrames - number of frames which were presented to the rendering pipeline and ended up being painted on the screen. Note that if the video is not on screen (e.g. in another tab or scrolled off screen), this counter will not increase.
  • mozFrameDelay - the time delay between presenting the last frame and it being painted on screen (approximately).

Mozilla are also working on some of the statistics listed here.

Webkit have implemented these:

  • webkitAudioDecodedByteCount - number of audio bytes that have been decoded.
  • webkitVideoDecodedByteCount - number of video bytes that have been decoded.
  • webkitDecodedFrameCount - number of frames that have been demuxed and extracted out of the media.
  • webkitDroppedFrameCount - number of frames that were decoded but not displayed due to performance issues.


Adobe Flash player has many statistics:

  • audioBufferByteLength - [read-only] Provides the NetStream audio buffer size in bytes.
  • audioBufferLength - [read-only] Provides NetStream audio buffer size in seconds.
  • audioByteCount- [read-only] Specifies the total number of audio bytes that have arrived in the queue, regardless of how many have been played or flushed.
  • audioBytesPerSecond - [read-only] Specifies the rate at which the NetStream audio buffer is filled in bytes per second.
  • audioLossRate - [read-only] Specifies the audio loss for the NetStream session.
  • byteCount- [read-only] Specifies the total number of bytes that have arrived into the queue, regardless of how many have been played or flushed.
  • currentBytesPerSecond - [read-only] Specifies the rate at which the NetStream buffer is filled in bytes per second.
  • dataBufferByteLength - [read-only] Provides the NetStream data buffer size in bytes.
  • dataBufferLength - [read-only] Provides NetStream data buffer size in seconds.
  • dataByteCount - [read-only] Specifies the total number of bytes of data messages that have arrived in the queue, regardless of how many have been played or flushed.
  • dataBytesPerSecond - [read-only] Specifies the rate at which the NetStream data buffer is filled in bytes per second.
  • droppedFrames - [read-only] Returns the number of video frames dropped in the current NetStream playback session.
  • isLive -[read-only] Returns whether the media being played is recorded or live.
  • maxBytesPerSecond - [read-only] Specifies the maximum rate at which the NetStream buffer is filled in bytes per second.
  • metaData - [read-only] Retrieve last meta data object associated with media being played.
  • playbackBytesPerSecond - [read-only] Returns the stream playback rate in bytes per second.
  • SRTT - [read-only] The smoothed round trip time (SRTT) for the NetStream session, in milliseconds.
  • videoBufferByteLength - [read-only] Provides the NetStream video buffer size in bytes.
  • videoBufferLength - [read-only] Provides NetStream video buffer size in seconds.
  • videoByteCount - [read-only] Specifies the total number of video bytes that have arrived in the queue, regardless of how many have been played or flushed.
  • videoBytesPerSecond - [read-only] Specifies the rate at which the NetStream video buffer is filled in bytes per second.
  • videoLossRate - [read-only] Provides the NetStream video loss rate (ratio of lost messages to total messages).


Silverlight advanced logging relevant stats:

  • c-buffercount: Number of times rebuffering is required. This is how many times we underflow basically. This is calculated at the frame level.
  • c-bytes: Number of bytes received by the client from the server. The value does not include any overhead that is added by the network stack. However, HTTP may introduce some overhead. Therefore, the same content streamed by using different protocols may result in different values. If c-bytes and sc-bytes(server-side) are not identical, packet loss occurred.
  • c-starttime: The point where the client began watching the stream (in seconds, no fraction). For true live streaming, we need to calculate time offset using wallclock time.
  • x-duration: Duration (in seconds) of the data rendered by the client from c-starttime.
  • startupTimeMs: From play to render first frame (in milliseconds).
  • bandwidthMax: Maximum perceived bandwidth
  • bandwidthMin: Minimum perceived bandwidth
  • bandwidthAvg: Average perceived bandwidth
  • droppedFramesPerSecond: Dropped frames per second (provided by Silverlight)
  • renderedFramesPerSecond: Rendered frames per second (provided by Silverlight)
  • audioResponseTimeAvg: Average response time to get audio chunks. This is time from request to last byte.
  • audioResponseTimeMax: Maximum response time to get audio chunks. This is time from request to last byte.
  • audioResponseTimeMin: Minimum response time to get audio chunks. This is time from request to last byte.
  • videoResponseTimeAvg: Average response time to get video chunks. This is time from request to last byte.
  • videoResponseTimeMax: Maximum response time to get video chunks. This is time from request to last byte.
  • videoResponseTimeMin: Minimum response time to get video chunks. This is time from request to last byte.
  • audioDownloadErrors: Total number of missing audio chunks (for example, 404s). This is a semicolon-separated list of starttime/chunk IDs.
  • videoDownloadErrors: Total number of missing video chunks (for example, 404s). This is a semicolon-separated list of starttime/chunk IDs
  • audioPlaybackBitrates: An ordered list of the audio bit-rates played during playback. This is a semicolon-separated list. This list is in the order of playback. There may be duplicate entries.
  • videoPlaybackBitrates: An ordered list of the video bit-rates played during playback. This is a semicolon-separated list. This list is in the order of playback. There may be duplicate entries.
  • audioBandwidthAvg: Average audio bit rate for the downloaded chunks
  • videoBandwidthAvg: Average video bit rate for the downloaded chunks
  • audioBufferSizeAvg: Average audio buffer size (in seconds) during playback
  • audioBufferSizeMax: Maximum audio buffer size (in seconds) during playback
  • videoBufferSizeAvg: Average video buffer size (in seconds) during playback
  • videoBufferSizeMax: Maximum video buffer size (in seconds) during playback


JW Player (using actionscript) broadcasts the following QOS metrics for both RTMP dynamic and HTTP adaptive:

  • bandwidth: server-client data rate, in kilobytespersecond.
  • latency: client-server-client roundtrip time, in milliseconds.
  • frameDropRate: number of frames not presented to the viewer, in frames per second.
  • screenWidth / screenHeight: dimensions of the video viewport, in pixels. Changes e.g. when the viewer jumps fullscreen.
  • qualityLevel: index of the currently playing quality level (see below).

Bandwidth and droprate are running metrics (averaged out). Latency and dimensions are sampled (taken once). For RTMP dynamic, the metrics are broadcast at a settable interval (default 2s). For HTTP adaptive, metrics are calculated and broadcast upon completion of a fragment load.

Separately, JW Player broadcasts a SWITCH event at the painting of a frame that has a different qualityLevel than the preceding frame(s). While the metrics.qualityLevel tells developers the qualityLevel of the currently downloading buffer/fragment, the SWITCH event tells developers the exact point in time where the viewer experiences a jump in video quality. This event also helps developers correlate the value of frameDropRate to the currently playing qualityLevel (as opposed to the currently loading one). Depending upon buffer, fragment and GOP size, the time delta between a change in metrics.qualityLevel and SWITCH.qualityLevel may vary from a few seconds to a few minutes.

Finally, JW Player accepts and exposes per video an array with quality levels (the distinct streams of a video between which the player can switch). For each quality level, properties like bitrate, framerate, height and width are available. The plain mapping using qualityLevel works b/c JW Player to date solely supports single A/V muxed dynamic/adaptive videos - no multi track.


For HTTP adaptive streaming the following statistics have been proposed:

  • downloadRate: The current server-client bandwidth (read-only).
  • videoBitrate: The current video bitrate (read-only).
  • droppedFrames: The total number of frames dropped for this playback session (read-only).
  • decodedFrames: The total number of frames decoded for this playback session (read-only).
  • height: The current height of the video element (already exists).
  • videoHeight: The current height of the videofile (already exists).
  • width: The current width of the video element (already exists).
  • videoWidth: The current width of the videofile (already exists).


Further, a requirement to expose playback rate statistics has come out of issue-147:

  • currentPlaybackRate: the rate at which the video/audio is currently playing back


Here are a few metrics that measure the QoS that a user receives:

  • playerLoadTime
  • streamBitrate

(user interaction and playthrough can be measured using existing events)


MPEG DASH defines quality metrics for adaptive streaming at several levels

  • What is presented to the user i.e. which portions of which versions of the streams and when they were presented. This information implicitly includes within it information like startup delay, timing and duration of pauses due to buffer exhaustion and overall quality (since it includes when the rate adaptation changes happen)
    • This could be very simply represented as a sequence of ( Time, StreamId, Playback Rate ) tuples, one for every point in time at which the playback rate or stream changed
  • Buffer levels over time within the player
  • Performance of the network stack. This includes, for each HTTP request
    • The URL and byte range requested
    • The time when the request was sent, when the response started to arrive and when the response was completed
    • The amount of data received
    • HTTP response code and, if applicable, redirect URL
    • A more detailed trace of data arrival rate, for example bytes received in each 1s or 100ms interval

All of this information is intended for performance monitoring purposes rather than to inform real-time action by the player. It's useful to separate these kinds of information. Performance monitoring information can be reported in batches to the application for reporting back to the server.

HTTP performance information is sometimes collected at the server side. However, with adaptive streaming where streams are constructed at the client from many HTTP requests this becomes more difficult since HTTP requests for a single viewing session may be spread across multiple servers (or even multiple CDNs). It becomes more important as streaming services evolve to collect this information from the client.

The DASH specification also includes information about the TCP level: what TCP connections were established and which HTTP requests were sent on which connection.

The first kind of information above (what is presented) is almost already available based on video element events (changes in the current playback rate). The exception is rate adaptation changes.


Network error codes

It's difficult to define an exhaustive, implementation-independent list of errors. A common solution is to report an "error chain" which is a sequence of increasingly-specific error codes, each of which is the "cause" of the more general error preceding it in the chain. At the end of the chain, implementation-specific error codes can be used. Simple applications can interpret the earlier, standardized, high-level errors. Commercial applications may have an incentive to interpret some of the implementation-specific ones - at least the ones they see often.

Proposal

The attributes specified here are intended to contain only those that are not calculable by the application from any other source (or be derivable from themselves). E.g. there is no need for a ‘frames per second’ attribute.

To be more specific, there are no time based derivatives as any sampling window will never be applicable for all applications - an application developer can use the attribute value to calculate the incoming rate by creating a timer and calculating the difference in values in successive timer calls.

Open issue: Do these values reset when the source media has changed?

bytesReceived

Applicable to: <video>, <audio>

The raw bytes received from the network for decode. Together with the downloadTime, this can be used to calculate the effective bandwidth. Includes container data.

Use case: This measures the network performance. This would be used to report the effective video bandwidth received by the browser to the server (i.e. the content publisher) to better understand what bandwidth is typically available to users and encode the videos on the server in the right set of bitrates (e.g. for what bandwidths are typically received in mobile / in desktop environments in different countries, so what should the edge servers be seeded with).

downloadTime

Applicable to: <video>, <audio>

The time since first HTTP request is sent until now or the download stops/finishes/terminates (whichever is earlier).

Use case: This allows to determine network performance together with "bytesReceived" by providing what time has actually been spent for downloading the resource (parts). This is better than just starting a timer when putting the resource URL into the video element, since times where the browser does not try to download more parts of the resource are not part of this downloadTime.


networkWaitTime

Applicable to: <video>, <audio>

The total duration of time when a playback is stalled, waiting to receive more data from network.

Use case: This allows to determine how much of the downloadTime is actually waiting time and thus allows to determine situations of network congestion and delays and separate that from the bandwidth that is available to the user.


videoBytesDecoded

Applicable to: <video>

The number of bytes of video data that have been decoded. This can be used to calculate the effective video bitrate. Does not include container data.

See suggestion below for a proposed alternative form.

Use case: This measures the decoding pipeline performance. This would be used to report whether the decoding pipeline keeps up with the playback position, given that the network feeds sufficient bytes into this pipeline. videoBytesDecoded would measure the performance of the video decoding pipeline.


audioBytesDecoded

Applicable to: <audio>, <video>

The number of bytes of audio data that have been decoded. This can be used to calculate the effective audio bitrate. Does not include container data.

Use case: This measures the decoding pipeline performance similar to videoBytesDecoded. audioBytesDecoded would measure the performance of the audio decoding pipeline.

Suggestion: Rename videoBytesDecoded and audioBytesDecoded to a simple bytesDecoded which is the sum of all decoded audio and video data (excluding container data) across all constituent tracks and potentially also expose bytesDecoded at the TrackList level (i.e. add getBytesDecoded(in unsigned long index)).

Use case: This measures the decoding pipeline performance similar to videoBytesDecoded and audioBytesDecoded, but would be applicable per track (and is therefore probably more generally applicable). It can e.g. help identify slow browsers, or plugins that slow down the decoding, or the optimal number of tracks to put into a file before the decoding engine gets overloaded.


decodedFrames

Applicable to: <video>

The number of frames of video that have been decoded and made available for presentation.

Use case: This measure the decoding pipeline performance on a frame level rather than a byte level to allow making judgements about the rendering engine when comparing to droppedFrames.


droppedFrames

Applicable to: <video>

The number of frames of video that have been dropped due to performance reasons. This does not include (for example) frames dropped due to seeking.

Use case: This measures the number of frames that are already too late to be handed to the rendering pipeline after decoding and thus measures the speed of the decoding pipeline.


presentedFrames

Applicable to: <video>

The number of frames of video that have been presented for compositing into the page. This could be used to calculate frames per second.

Use case: This measures the display quality that the user perceives and can determine the performance of the rendering engine given the performance of the network and decoding pipeline. If, for example, the system receives sufficient data from the network (no droppedFrames), but the rate of presented frames per second is below 15, we can assume that the user gets a pretty crappy presentation because the rendering engine was too slow, the machine is likely overloaded or not capable of rendering this video at this quality and we should probably move to a lower bitrate (resolution or framerate) resource for the video.


playbackJitter

Applicable to: <video>

It is useful for applications to be able obtain an overall metric for perceived playback quality and smoothness. This value is the sum of all duration errors for frames intended to be presented to the user, where:

Ei
Desired duration of frame i spent on the screen (to nearest microsecond?)
Ai
Actual duration frame i spent on the screen (if the frame is never presented to the user, then Ai == 0).

then:

playbackJitter = sum(abs(Ei - Ai))

The application could use this value by sampling it at the beginning and end of a five second window and dividing it by the expected number of frames delivered (which would normally be presentedFrames + droppedFrames).