Timed tracks

This page contains notes for the development of the first version of timed track features in HTML.

PLEASE DO NOT DIRECTLY MODIFY THIS PAGE, AS IT IS JUST HIXIE'S NOTES. IF YOU WANT TO CONTRIBUTE TO THIS PAGE, EITHER ADD EXAMPLES OF REAL-WORLD USE CASES TO THE TWO PAGES ABOVE, OR E-MAIL HIXIE OR THE LIST.

Requirements

Subtitle/Caption/Karaoke File Format

per-cue in/out times
- relative timings would be useful while editing, but may not be necessary in the published format
inline time cues for karaoke
bidi, newlines, ruby, italics [there's been no evidence provided that there's any need for more fine-grained control at a per-cue level]
voice selection (so that e.g. sfx descriptions and each character can be a different colour)
per cue vertical position: % of vertical video height (default 100%)
per cue horizontal position: % of horizontal video width (default 50%)
per cue direction: horizontal/vertical (default horizontal)
per cue width/height: % (default is remaining space on line given alignment)
per cue text alignment: start/middle/end (default middle for subtitles, start for captions)
multiple cues placed in adjacent places (e.g. from different voices or with slightly different times) would need to automatically stack so they don't overlap
- but should support multiple cues from multiple voices on the same "line", e.g. when two people both utter something at the same time (need an example of this).

(Percentage positions would work like background-position in CSS.)

Formatting

Inline

text should be bidi-aware
some cases use ruby
some cases use italics

Global

color of background/text/outline is needed for readability on different types of video.
webfonts is needed to provide high quality subtitles in some non-Latin languages (e.g. Chinese where a suitable font is unlikely to be available even on Chinese computer systems).
providing a pseudo-element to style each voice would likely be sufficient for authors who want overall formatting control (this would also allow user overrides conveniently)

HTML

an API and UI for exposing what timed tracks exist and selectively enabling/disabling them

format for external subtitles/captions
format for external audio descriptions
some mechanism for text in the page to be used instead of external files, for subtitles/captions or audio description
an API to allow a segment to be dynamically inserted into the rendering on the fly

an API for exposing what the currently relevant segments of each timed track are
a way to hook into this mechanism to advance slides

native rendering of subtitles
native rendering of audio descriptions
native rendering of multiple audio or video tracks, to allow pre-recorded audio descriptions to be mixed in and sign language video to be overlaid
a way to hook into this to manually render timed tracks

Architecture

Declaring timed tracks

Each timed track is either:

enabled, in which case it is downloaded, triggers events, and if appropriate is rendered by the user agent; or
disabled, in which case it does nothing

The enabled/disabled state is by default based on user preferences and the kind of timed track as described below, but can be overridden on a per-track basis.

Each timed track has a kind which is one of:

for visual display (subtitles, captions, translations), enabled based on user preferences, shows in video playback area
for audio playback (text audio descriptions), enabled based on user preferences, renders as audio
for navigation (chapter titles), enabled by default, shows in UA UI
for off-video display (lyrics), disabled by default in this version, not shown by UA
for metadata (slide timings, annotation data for app-rendered annotations), enabled by default, not shown by UA

Tracks that are for visual display or audio playback have additionally a user-facing label and a language.

Tracks that are for visual display have an additional boolean indicating if they include sound effects and speaker identification (intended for the deaf, hard of hearing, or people with sound muted) or not (i.e. translations intended for people with audio enabled but who cannot understand the language, or karaoke lyrics).

Each timed track associated with a media resource, like the media resource itself, can have multiple sources.

Each source for a timed track has:

URL
type (if there are multiple sources)
media

The media resource can also imply certain timed tracks based on data in the media resource.

The script can also add "virtual" timed tracks dynamically.

Markup

<track src="" enabled="true" kind="" label="" lang=""></track>

<track enabled="true" kind="" label="" lang="">
 <source src="" type="" media="">
 ...
</track>

enabled="" is true or false.

Values for kind="":

subtitles (includes karaoke) - default
captions
description (text audio descriptions)
chapters
lyrics
metadata

Questions:

Should we use lang="", hreflang="", srclang=""?

Is there a better solution to enabled=false for disabling tracks by default? Do we ever need to disable a track that might be enabled by default?

Visual titles

File format

Should be backwards-compatible with an existing format, ideally SRT given the huge volume of subtitles available in SRT format on the Web today.

...

Processing model

...

CSS extensions

...

DOM API

HTMLMediaElement
 attribute MediaTrack[] tracks;
 MutableMediaTrack addTrack(label, kind, language);

MediaTrack
 readonly attribute DOMString label;
 readonly attribute DOMString kind; // subtitles, captions, descriptions, chapters, lyrics, metadata
 readonly attribute DOMString language;
 readonly attribute unsigned short mode;
   const unsigned short TRACK_OFF = 0; // not firing events, may not even be downloaded yet
   const unsigned short TRACK_HIDDEN = 1; // firing events but otherwise ignored by UA - intended for scripts
   const unsigned short TRACK_SHOWING = 2; // browser is handling it
 readonly attribute MediaCue[] cues; // sorted in startTime order
 readonly attribute MediaCue[] activeCues; // sorted in endTime order?
 readonly attribute Function onentercue; // fires CueEvent
 readonly attribute Function onexitcue; // fires CueEvent

MutableMediaTrack: MediaTrack
 void addCue(cue); // throws if cue.track != null
 void removeCue(cue); // throws if cue isn't in this track

MediaCue
 readonly attribute MediaTrack track; // null if newly created and not yet added to a track
 readonly attribute DOMString id; // empty string if not applicable
 readonly attribute float startTime;
 readonly attribute float endTime;
 readonly attribute unsigned short horizontalPosition;
 readonly attribute unsigned short verticalPosition;
 readonly attribute unsigned short size;
 readonly attribute DOMString direction; // horizontal, vertical
 readonly attribute DOMString alignment; // start, middle, end
 readonly attribute DOMString voice; // for styling purposes
 DocumentFragment getCueAsHTML(); // returns a copy of the cue as HTML, with the current position in the case of karaoke lyrics annotated using a ProcessingInstruction or some such

 Constructor for MediaCue: new MediaCue(id, startTime, endTime, hPos, vPos, size, dir, align, voice, text); // text gets parsed like the cues in the main format, whatever that ends up being

CueEvent
 readonly attribute MediaCue cue;

HTMLTrackElement
 readonly attribute MediaTrack track;

Other minor things

We need to make sure that media playback is paused until all enabled timed tracks are locally available.

We need to block cross-origin tracks (eventually blocking only those that aren't CORS-enabled).

Open issues

Synchronised media

For now, sign-language and alternate or additive audio tracks (e.g. audio description tracks) have to be in-band, because UA vendors are refusing to implement synchronisation of external media tracks for now.

However, we should bear it in mind. Adding that kind of thing to the API is going to be non-trivial. The simplest way is probably to just to require that the authors use multiple <video>/<audio> elements and we link them somehow; with one designated as the "sync clock" with them all syncing to it, rather than having each <video> element expose multiple "buffered" "seekable" etc.

Streaming

Do we need to handle live transcription and streaming titles in external files? If so, how?

For now, it's not clear if there are any use case for streaming external timed track resources.

Timed tracks

Contents

Requirements

Subtitle/Caption/Karaoke File Format

Formatting

Inline

Global

HTML

Architecture

Declaring timed tracks

Markup

Visual titles

File format

Processing model

CSS extensions

DOM API

Other minor things

Open issues

Synchronised media

Streaming

Navigation menu

Timed tracks

Requirements

Subtitle/Caption/Karaoke File Format

Formatting

Inline

Global

HTML

Architecture

Declaring timed tracks

Markup

Visual titles

File format

Processing model

CSS extensions

DOM API

Other minor things

Open issues

Synchronised media

Streaming

Navigation menu

Search