The purpose of this page is to lay out and hammer down a specification for implementing captioning, subtitling, and timed text support for media HTML elements: both Video and Audio. This is a work in progress, and is being maintained by User:Millam
- Timed Text: Text that occurs at specified times during playback of the media element.
- Captions: A transcription of all spoken words and relevant sounds, targeted at deaf and hearing impaired audiences. Also known as "Subtitles for the Deaf and Hard of Hearing" (SDH or SDHH)
- Subtitles: A transcription of spoken words. Frequently translated. May include translations of on-screen text (e.g: "Hello, my name is Greg" tag -> "Hola, me llamo Greg")
- Closed Captioning: When the captions are separate from the media stream, and can be toggled on and off.
- Open Captioning: When the captions are "burned" into the video, and thus can't be toggled on and off.
- External captions: When the captions are kept in a file separate from the media file. (e.g: .srt, .ass)
- Included captions: When the captions are included in the video file, whether as a separate text track or as 'prerendered' captions. (Described in Caption Formats, below)
- Un-styled captions: Captions that have no styling information. Just plain text.
- Styled captions: Captions that have one or more bits of styling: Color, Positioning (relative to the video), 'Karaoke' color changing, animation effects, etc.
- Rollup captions: Captions that 'roll up'. Typically, there's a window of 3 lines, and as captions are added to it, they are added to the bottom line, pushing older text up, or out of frame. Most typically used with news stations, soap operas, and broadcast events that are captioned live.
From the UserAgent perspective, Timed Text, Captions and Subtitles are functionally identical, the only difference is their described content. For the purpose of this document, I will be using "captions" and "captioning" to refer to all of the above.
One of the largest barriers to adding captioning to standards in the digital age is the sheer number of formats available. The below list is a small sampling.
- 608/708, or "Line 21" captions: Designed for Television, caption data is encoded in scanline 21. Text here is often broken up, and drawn by command.
- "Prerendered" captions. Designed for DVDs, these are actually transparent video frames that are drawn on top of the video.
- Subrip (.sub or .srt). Plain text files, where each separate caption consists of three or more lines: The number of the caption (optional), the start time (and optional end time), and the caption text, unstyled. Very easy to read and write by hand.
- SSAV4 (.ssav4 or .ass). A very flexible, styled format. Very verbose, authors usually prefer to use caption editors to create and edit these files.
This section describes how the page author and the page viewer utilize captions.
The author writes an HTML5 page includes a media element (<video>...</video> or <audio>...</audio>) and wishes to add captioning using external captioning in one or more languages. For a single language in a standalone media tag, the author will include a 'subs="..."' tag to define the location of a single captioning file. For multiple languages, the author will use a new element, <track>, to define a caption track.
The user then visits the author's page. The user agent loads the