A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
Timed tracks
This page contains notes for the development of the first version of timed track features in HTML.
See also use cases for timed tracks rendered over video by the UA, use cases for API-level access to timed tracks.
PLEASE DO NOT DIRECTLY MODIFY THIS PAGE, AS IT IS JUST HIXIE'S NOTES. IF YOU WANT TO CONTRIBUTE TO THIS PAGE, EITHER ADD EXAMPLES OF REAL-WORLD USE CASES TO THE TWO PAGES ABOVE, OR E-MAIL HIXIE OR THE LIST.
Requirements
Subtitle/Caption/Karaoke File Format
- per-cue in/out times
- inline time cues for karaoke
- bidi, newlines, ruby, italics [there's been no evidence provided that there's any need for more fine-grained control at a per-cue level]
- voice selection (so that e.g. sfx descriptions and each character can be a different colour)
- pre cue vertical position: top/middle/bottom/% (default bottom)
- pre cue horizontal position: left/center/right/% (default center for subtitles, left for captions)
- per cue text alignment: horizontal text: left, right, center; vertical text: top, middle, bottom
- display modes
- replace previous text (pop-up)
- scroll previous text up and add to bottom (roll-up)
- incremental text for live captions, if we support those
- multiple cues placed in adjacent places (e.g. from different voices or with slightly different times) would need to automatically stack so they don't overlap
(Percentage positions would work like background-position in CSS.)
Formatting
Inline
- text should be bidi-aware
- some cases use ruby
- some cases use italics
Global
- color of background/text/outline is needed for readability on different types of video.
- webfonts is needed to provide high quality subtitles in some non-Latin languages (e.g. Chinese where a suitable font is unlikely to be available even on Chinese computer systems).
- providing a pseudo-element to style each voice would likely be sufficient for authors who want overall formatting control (this would also allow user overrides conveniently)
HTML
- an API and UI for exposing what timed tracks exist and selectively enabling/disabling them
- format for external subtitles/captions
- format for external audio descriptions
- some mechanism for text in the page to be used instead of external files, for subtitles/captions or audio description
- an API to allow a segment to be dynamically inserted into the rendering on the fly
- an API for exposing what the currently relevant segments of each timed track are
- a way to hook into this mechanism to advance slides
- native rendering of subtitles
- native rendering of audio descriptions
- native rendering of multiple audio or video tracks, to allow pre-recorded audio descriptions to be mixed in and sign language video to be overlaid
- a way to hook into this to manually render timed tracks
Architecture
<img src="http://docs.google.com/drawings/pub?id=1GR6Pzq0GY2n1sx_ZjDfuICM2LnXxLVxzvyl4kuQy-48&w=640&h=480">
Declaring timed tracks
Each timed track is either:
- enabled, in which case it is downloaded, triggers events, and if appropriate is rendered by the user agent; or
- disabled, in which case it does nothing
The enabled/disabled state is by default based on user preferences and the kind of timed track as described below, but can be overridden on a per-track basis.
Each timed track has a kind which is one of:
- for visual display (subtitles, captions, translations), enabled based on user preferences, shows in video playback area
- for audio playback (text audio descriptions), enabled based on user preferences, renders as audio
- for navigation (chapter titles), enabled by default, shows in UA UI
- for off-video display (lyrics), disabled by default in this version, not shown by UA
- for metadata (slide timings, annotation data for app-rendered annotations), enabled by default, not shown by UA
Tracks that are for visual display or audio playback have additionally a user-facing label and a language.
Tracks that are for visual display have an additional boolean indicating if they include sound effects and speaker identification (intended for the deaf, hard of hearing, or people with sound muted) or not (i.e. translations intended for people with audio enabled but who cannot understand the language, or karaoke lyrics).
Each timed track associated with a media resource, like the media resource itself, can have multiple sources.
Each source for a timed track has:
- URL
- type (if there are multiple sources)
- media
The media resource can also imply certain timed tracks based on data in the media resource.
The script can also add "virtual" timed tracks dynamically.
Markup
<track src="" enabled="true" kind="" label="" lang=""></track>
<track enabled="true" kind="" label="" lang="">
<source src="" type="" media=""> ...
</track>
enabled="" is true or false.
Values for kind="":
- subtitles (includes karaoke) - default
- captions
- description (text audio descriptions)
- chapters
- lyrics
- metadata
Questions:
- Should we use lang="", hreflang="", srclang=""?
- Is there a better solution to enabled=false for disabling tracks by default? Do we ever need to disable a track that might be enabled by default?
Format for visual titles
...
...
DOM API
...
Other minor things
We need to make sure that media playback is paused until all enabled timed tracks are locally available.
Open issues
How do we handle sign-language tracks?
Do we handle multiple alternate audio tracks from this or is that restricted to in-band data? (in the media resource)
Do we need to handle live transcription and streaming titles in external files? If so, how?