The purpose of this page is to lay out and hammer down a specification for implementing captioning, subtitling, and timed text support for media HTML elements: both Video and Audio. This is a work in progress, and is being maintained by User:Millam
- 1 Terms
- 2 Caption Formats
- 3 Overview
- 4 Typical Usage
- 5 The Caption Track Element
- 6 Problems, other thoughts
- Timed Text: Text that occurs at specified times during playback of the media element.
- Captions: A transcription of all spoken words and relevant sounds, targeted at deaf and hearing impaired audiences. Also known as "Subtitles for the Deaf and Hard of Hearing" (SDH or SDHH)
- Subtitles: A transcription of spoken words. Frequently translated. May include translations of on-screen text (e.g: "Hello, my name is Greg" tag -> "Hola, me llamo Greg")
- Visual Description: A written description of the visual scenes. May or may not include spoken words. Targeted at visually impaired users.
- Closed Captioning: When the captions are separate from the media stream, and can be toggled on and off.
- Open Captioning: When the captions are "burned" into the video, and thus can't be toggled on and off.
- External captions: When the captions are kept in a file separate from the media file. (e.g: .srt, .ass)
- Included captions: When the captions are included in the video file, whether as a separate text track or as 'prerendered' captions. (Described in Caption Formats, below)
- Un-styled captions: Captions that have no styling information. Just plain text.
- Styled captions: Captions that have one or more bits of styling: Color, Positioning (relative to the video), 'Karaoke' color changing, animation effects, etc.
- Pop-on or Paint-on captions: From Television 608/708 standards, Pop-on and Paint on are positioned, but otherwise unstyled text.
- Rollup captions: Captions that 'roll up'. Typically, there's a window of 3 lines, and as captions are added to it, they are added to the bottom line,
pushing older text up, or out of frame. Most typically used with news stations, soap operas, and broadcast events that are captioned live.
From the UserAgent perspective, Timed Text, Captions and Subtitles are functionally identical, the only difference is their described content. For the purpose of this document, I will be using "captions" and "captioning" to refer to all of the above.
One of the largest barriers to adding captioning to standards in the digital age is the sheer number of formats available. The below list is a small sampling.
- 608/708, or "Line 21" captions: Designed for Television, caption data is encoded in scanline 21. Text here is often broken up, and drawn by command.
- "Prerendered" captions. Designed for DVDs, these are actually transparent video frames that are drawn on top of the video.
- Subrip (.srt). Plain text files, where each separate caption consists of three or more lines: The number of the caption (optional), the start time (and optional end time), and the caption text, unstyled. Very easy to read and write by hand.
- SSAV4 (.ssa or .ass). A very flexible, styled format. Very verbose, authors usually prefer to use caption editors to create and edit these files.
- DFXP: W3C's overly verbose (imho) timed text method.
- SAMI, CMML, SMIL, others: Described at [Subtitle_file_formats] on Wikipedia.
For HTML5, we should support all Included Captions for all containers that our media elements support, and two External Caption formats:
- Subrip: For simplicity, ease of creation, ease of use, and to allow authors to style their text with CSS.
- SSAV4: At the opposite end, for more complex uses: Karaoke, etc.
This section describes how the page author and the page viewer utilize captions.
The author writes an HTML5 page includes a media element (<video>...</video> or <audio>...</audio>) and wishes to add captioning using external captioning in one or more languages. For a single language in a standalone media tag, the author will include a 'subs="..."' tag to define the location of a single captioning file. For multiple languages, the author will use a new element, <captiontrack>, to define a caption track. (Name of <captiontrack> element to be decided.)
The user then visits the author's page. The user agent loads the video, then fetches the external caption track. The User Agent then determines whether to turn captions on: Either default on (preferred), or the user has expressed a preference to enable captioning. If captions are on, they are then rendered on top the video. Whether on or off, the User Agent should provide a method to enable or disable the caption track(s).
The Caption Track Element
This is big enough to deserve its own section. The largest problem with caption elements is adding the ability to deal with all three major types of captions: unstyled, styled, and prerendered.
Implicit and Explicit tracks
Implicit tracks are Included Captions: They are part of the media stream. It is very unlikely to know the entirety of the caption track until the entire media file has been received and parsed.
Explicit tracks are External Captions: They are fetched from a separate URI by the User Agent, and have their timings tied to the media element.
Because of the streaming features of the video tag, the Implicit Tracks thus require that captions cannot be treated as if the entirety is known.
From this point on, "Caption Stream" refers to either an implicit or explicit track.
- Name: This is the name of the caption track. e.g: "Default", "Comments", "Auto-translated", "Translated by Cervantes". If null, it defaults to "default".
- Language: This is the language (*NOT* the text encoding!) of the caption track. This is the language _code_ that describes which language the caption file is in. "en", "pt", "fr", etc. If null, defaults to the language of the page, if known, or en.
- captiontype: "caption", "subtitle", "description", "other", or null.
Type and Language should be considered by the User Agent when deciding whether to enable or disable. Name is used for display to the user for user selection, and is effectively cosmetic. (Language,Name) tuple should be unique across all tracks.
Any or all of the three may be null, because the largest use case is an author adding a single caption track to their video. It also handles the case where tracks are "implicit"
- format: "styled", "unstyled", or "prerendered.". If null, the format of the caption stream is used. Subrip is unstyled. ssav4 is styled. DVD-style tracks are prerendered.
- encoding: Only relevant to styled and unstyled text. Describes the character encoding of the content.
Lifetime of the track elements
Enabling and disabling
If the User Agent displays media controls (pause, start, stop, rewind, seek, etc), they must also include a caption control - if there is at least one caption track. If there is no caption track, the User Agent can elect to not display the caption control or to display a disabled caption control.
If the User Agent does not display media controls, and caption tracks are associated with the media, then the User Agent must include a caption control in its context menu or in an easily accessible toolbar menu.
User Agents are recommended, but not required, to provide a configuration option in the browser accessibility settings for four caption settings:
- Use visual descriptions. (Subject to rules for captions, below).
- And the following, grouped: (Selection box? 0 = disabled, 1 = my, 2 = unknown, 3 = any?)
- Auto-enable captions in my language. (Most specific)
- Auto-enable captions of unknown language.
- Auto-enable captions of any language. (Least specific)
In the case of multiple caption tracks, then one matching the user's language is first choice. Type has priority from "caption" > "subtitle", unless visual descriptions have been chosen by the user. In which case, visual descriptions have top priority.
There is an implicit block-level element, <captionblock> (Name undecided, see #BlockElement) overlaying the video element, and of the same size and location. If it is a video element, the User Agent may choose to represent the block level element with varying sizes. This implicit block level element is used to place text (styled and unstyled) inside. This block level element allows the web author or the user agent to style the caption text.
Undecided: Should prerendered frames be treated as image elements within the block level element, or just be considered part of the video, and enabled/disabled?
When the user_agent enables a Caption Stream, it triggers a caption_enable event. caption_disable event corresponds to disabling it.
When enabled, the User Agent ties the Caption Stream to the video player. At the appropriate start and end times, or during seek, the User Agent will trigger a caption_add and caption_clear event. (Within the User Agent)
Prerendered captions: Instruct the video decoder to display the selected prerendered frames.
Styled and unstyled text: Add a listener or a poller to the media element to trigger caption_add and caption_clear events at appropriate times.
Prerendered captions: Instruct the decoder to stop displaying the prerendered frames.
Styled and unstyled text: Remove the listener or poller from the media element.
Run when a particular caption needs to be displayed.
Prerendered: Invalid, prerendered is handled by the decoder. caption_add should not be called.
Unstyled text: placed in the implicit <captionblock> element.
Styled text: Converted to HTML5, then placed in the implicit <CaptionBlock> element.
Problems, other thoughts
- What can we do about Karaoke captions? These are captions that gradually turn from one color to another: Going from left to right.