A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Timed tracks: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
No edit summary
m (removed Category:Proposals using HotCat)
 
(129 intermediate revisions by 7 users not shown)
Line 1: Line 1:
I'm starting to look at the feedback sent over the past few years for
{{obsolete|spec=[https://html.spec.whatwg.org/multipage/embedded-content.html#the-track-element HTML: <track>]}}
augmenting audio and video with additional timed tracks such as subtitles,
captions, audio descriptions, karaoke, slides, lyrics, ads, etc. One thing
that would be really helpful is if we could get together a representative
sample of typical uses of these features, as well as examples of some of
the more extreme uses.


If anyone has any examples, please add them below.
This page contains notes for the development of the first version of timed track features in HTML.
Links to either videos or stills showing subtitles (e.g. on TVs, DVDs,
etc) are both good.
I'd like to get a representative sample so that we can deteremine what features are critical, and what features can be punted for now. —[[User:Hixie|Hixie]] 21:07, 16 April 2010 (UTC)


See also [[use cases for timed tracks rendered over video by the UA]], [[use cases for API-level access to timed tracks]].


== Examples ==
== Requirements ==


These are categorised by what features they demonstrate that is most interesting.
=== Subtitle/Caption/Karaoke File Format ===


=== Ruby ===
* per-cue in/out times
** relative timings would be useful while editing, but may not be necessary in the published format
* inline time cues for karaoke
* bidi, newlines, ruby, italics, bold [there's been no evidence provided that there's any need for more fine-grained control at a per-cue level]
* voice selection (so that e.g. sfx descriptions and each character can be a different colour)
* per cue vertical position: % of vertical video height (default 100%)
* per cue horizontal position: % of horizontal video width (default 50%)
* per cue direction: horizontal/vertical (default horizontal)
* per cue width/height: % (default is remaining space on line given alignment)
* per cue text alignment: start/middle/end (default middle for subtitles, start for captions)
* multiple cues placed in adjacent places (e.g. from different voices or with slightly different times) would need to automatically stack so they don't overlap
** but should support multiple cues from multiple voices on the same "line", e.g. when two people both utter something at the same time (need an example of this).


<img src="http://graphics8.nytimes.com/images/blogs/screens/06subs.jpg">
(Percentage positions would work like background-position in CSS.)


==== Track-wide formatting ====
* color of background/text/outline is needed for readability on different types of video, unless UA default has clear contrasting outlines or an opaque background.
* webfonts is needed to provide high quality subtitles in some non-Latin languages (e.g. Chinese where a suitable font is unlikely to be available even on Chinese computer systems).
* providing a pseudo-element to style each voice would likely be sufficient for authors who want overall formatting control (this would also allow user overrides conveniently)


=== Multiple voices ===
=== HTML ===


<img src="http://philip.html5.org/misc/portal-caption.jpg">
* an API and UI for exposing what timed tracks exist and selectively enabling/disabling them


* format for external subtitles/captions
* format for external audio descriptions
* some mechanism for text in the page to be used instead of external files, for subtitles/captions or audio description
* an API to allow a segment to be dynamically inserted into the rendering on the fly


=== Important line feeds ===
* an API for exposing what the currently relevant segments of each timed track are
* a way to hook into this mechanism to advance slides


<img src="http://www.craphound.com/images/itcrowdleetsubs.jpg">
* native rendering of subtitles
<img src="http://joshkinberg.com/blog/files/debatewars.jpg">
* native rendering of audio descriptions
<img src="http://208.71.113.236/final/2/8/2831661/358818.jpg">
* native rendering of multiple audio or video tracks, to allow pre-recorded audio descriptions to be mixed in and sign language video to be overlaid
<img src="http://cdn.fd.uproxx.com/wp-content/uploads/2009/03/letheright1_or.jpg">
* a way to hook into this to manually render timed tracks
<img src="http://www.walkernews.net/wp-content/uploads/2007/11/mpc-storm-codec.JPG">




=== Plain text ===
== Architecture ==


<img src="http://www.insidesocal.com/tomhoffarth/Miss-Teen-South-Carolina-Subtitles.jpg">
<img src="http://docs.google.com/drawings/pub?id=1GR6Pzq0GY2n1sx_ZjDfuICM2LnXxLVxzvyl4kuQy-48&w=640&h=480">
<img src="http://dolphy-tech.net/files/subtitles_mkv.png">
 
<img src="http://filegets.com/screenshots/full/subtitle-player_15496.jpg">
The caption format should be reasonable for either a web engine or a media engine to render, since implementation strategies may differ.
<img src="http://www.bbc.co.uk/blogs/bbcinternet/img/iplayer_subtitles_russia.jpg">
 
<img src="http://startupmeme.com/wp-content/uploads/2008/08/youtubecaptions-thumb1.png">
=== Declaring timed tracks ===
<img src="http://1.bp.blogspot.com/_EuCTzLdp3vE/SYtn40aQjDI/AAAAAAAACXY/95EgLfOttiA/s400/video_captions.jpg">
 
<img src="http://208.71.113.236/final/2/8/2852791/405302.jpg">
Each timed track is either:
<img src="http://jeanviet.info/astimg/sous-titre-divx.jpg">
 
<img src="http://www.mobiletopsoft.com/images/news/tcpmp_gora_playback_subtitles_2.jpg">
*enabled, in which case it is downloaded, triggers events, and if appropriate is rendered by the user agent; or
<img src="http://www.tiresias.org/research/guidelines/television/images/subtitles2.jpg">
*disabled, in which case it does nothing
<img src="http://img117.exs.cx/img117/5200/AnimalPlanetTV.jpg">
 
<img src="http://duhn.net/wp-content/uploads/dtt-tv2-subtitles.jpg">
The enabled/disabled state is by default based on user preferences and the kind of timed track as described below, but can be overridden on a per-track basis.
<img src="http://www.hack7mc.com/wp-content/uploads/2009/04/earthstoodstill.png">
 
<img src="http://img.youtube.com/vi/nqGOOTjxTZ0/0.jpg">
Each timed track has a kind which is one of:
<img src="http://aboutonlinetips.com/wp-content/uploads/2008/12/subtitles-in-movie.jpg">
* for visual display (subtitles, captions, translations), enabled based on user preferences, shows in video playback area
* for audio playback (text audio descriptions), enabled based on user preferences, renders as audio
* for navigation (chapter titles), enabled by default, shows in UA UI
* for off-video display (lyrics), disabled by default in this version, not shown by UA
* for metadata (slide timings, annotation data for app-rendered annotations), enabled by default, not shown by UA
 
Tracks that are for visual display or audio playback have additionally a user-facing label and a language.
 
Tracks that are for visual display have an additional boolean indicating if they include sound effects and speaker identification (intended for the deaf, hard of hearing, or people with sound muted) or not (i.e. translations intended for people with audio enabled but who cannot understand the language, or karaoke lyrics).
 
Each timed track associated with a media resource, like the media resource itself, can have multiple sources.
 
Each source for a timed track has:
* URL
* type (if there are multiple sources)
* media
 
The media resource can also imply certain timed tracks based on data in the media resource.
 
The script can also add "virtual" timed tracks dynamically.
 
==== Markup ====
 
<pre>
<track src="" enabled="true" kind="" label="" lang=""></track>
<track enabled="true" kind="" label="" lang="">
  <source src="" type="" media="">
  ...
</track>
</pre>
 
enabled="" is true or false.
 
Values for kind="":
 
* subtitles (includes karaoke) - default
* captions
* description (text audio descriptions)
* chapters
* lyrics
* metadata
 
Questions:
 
* Should we use lang="", hreflang="", srclang=""?
 
* Is there a better solution to enabled=false for disabling tracks by default? Do we ever need to disable a track that might be enabled by default?
 
=== Visual titles ===
 
==== File format ====
 
Based on studying a broad range of [[Timed track formats]], there does not appear to be a format that is easy to read and write, supports automatic positioning to avoid overlapping titles while still supporting some level of positioning control, supports temporally-overlapping titles, uses video-independent positioning instead of pixel-based (for visual) or frame-based (for temporal) positioning, and supports some inline structure for ruby, italics, bold, and karaoke.
 
The two formats that are the cleanest in terms of existing syntax, that are a subset of the above feature set, and that can be extended relatively cleanly in a backwards-compatible way are the FAB subtitler format and the SRT format. The former, however, lacks much documentation. The latter appears to be more well-known.
 
Proposal: http://damowmow.com/temp/srtspec
 
==== CSS extensions ====
 
Cues are rendered as block boxes with inline boxes. Cues have a voice (identified by a keyword or a number). Cues can have a part that is before the current time and a part after the current time.
 
The block box is matched by the pseudo-element ::cue on the media element (<video>).
Only visible cues are matched (those on tracks enabled and shown by the UA whose start/end time range contains the current time).
The ::cue pseudo takes an optional argument that is the voice of cues that it is to match. The keyword "*", matching all voices, is assumed if the argument is absent.
 
video::cue { color: white; background: rgba(0,0,0,0.5); font: 900 sans-serif; text-transform: uppercase; }
video::cue(narrator) { color: white; font-style: italics; }
video::cue(1) { color: yellow; }
video::cue(2) { color: lime; }
 
The ::cue pseudo when given _two_ arguments matches all innermost inline boxes in the cue of the element that match its second argument. Its first argument is a voice; the keyword "*" matches all voices. Its second argument is one of "i", "b", "ruby", "rt" (matches inline boxes immediately inside one of those annotations), "before", "after" (fragments before/after the current time).
 
video::cue(*, i) { font-style: italic; }
video::cue(narrator, i) { font-style: bold; }
video::cue(*, b) { font-size: larger; }
 
[This isn't great. Any better ideas?]
 
=== DOM API ===
 
HTMLMediaElement
  attribute MediaTrack[] tracks;
  MutableMediaTrack addTrack(label, kind, language);
 
MediaTrack
  readonly attribute DOMString label;
  readonly attribute DOMString kind; // subtitles, captions, descriptions, chapters, lyrics, metadata
  readonly attribute DOMString language;
  readonly attribute unsigned short mode;
    const unsigned short TRACK_OFF = 0; // not firing events, may not even be downloaded yet
    const unsigned short TRACK_HIDDEN = 1; // firing events but otherwise ignored by UA - intended for scripts
    const unsigned short TRACK_SHOWING = 2; // browser is handling it
  readonly attribute MediaCue[] cues; // sorted in startTime order
  readonly attribute MediaCue[] activeCues; // sorted in endTime order?
  readonly attribute Function onentercue; // fires CueEvent
  readonly attribute Function onexitcue; // fires CueEvent
 
MutableMediaTrack: MediaTrack
  void addCue(cue); // throws if cue.track != null
  void removeCue(cue); // throws if cue isn't in this track
 
MediaCue
  readonly attribute MediaTrack track; // null if newly created and not yet added to a track
  readonly attribute DOMString id; // empty string if not applicable
  readonly attribute float startTime;
  readonly attribute float endTime;
  readonly attribute boolean snapToLines;
  readonly attribute long linePosition;
  readonly attribute long textPosition;
  readonly attribute long size;
  readonly attribute DOMString direction; // horizontal, vertical
  readonly attribute DOMString alignment; // start, middle, end
  readonly attribute DOMString voice; // for styling purposes
  DOMString getCueAsSource(); // returns the cue as it was expressed in the file (for XML formats, this reserializes, expressing all namespaces appropriately)
  DocumentFragment getCueAsHTML(); // returns a copy of the cue as HTML, with the current position in the case of karaoke lyrics annotated using a ProcessingInstruction or some such; throws if the format doesn't define a conversion to HTML
 
  Constructor for MediaCue: new MediaCue(id, startTime, endTime, settings, text); // settings and text get parsed like the cues in the main format, whatever that ends up being
 
CueEvent
  readonly attribute MediaCue cue;
 
HTMLTrackElement
  readonly attribute MediaTrack track;
 
=== Other minor things ===
 
We need to make sure that media playback is paused until all enabled timed tracks are locally available.
 
We need to block cross-origin tracks (eventually blocking only those that aren't CORS-enabled).
 
== Open issues ==
 
=== Synchronised media ===
 
For now, sign-language and alternate or additive audio tracks (e.g. audio description tracks) have to be in-band, because UA vendors are refusing to implement synchronisation of external media tracks for now.
 
However, we should bear it in mind. Adding that kind of thing to the API is going to be non-trivial. The simplest way is probably to just to require that the authors use multiple <video>/<audio> elements and we link them somehow; with one designated as the "sync clock" with them all syncing to it, rather than having each <video> element expose multiple "buffered" "seekable" etc.
 
=== Streaming ===
 
Do we need to handle live transcription and streaming titles in external files? If so, how?
 
For now, it's not clear if there are any use case for streaming external timed track resources.
 
Web based radio might benefit from serving a live audio stream with song title and other details like a artist URL, but it's not clear that this needs to be a timed track (it could be a WebSocket or EventSource feed).
 
 
== Specification approach ==
 
# Add <track> element
# Add concept of a media element's timed tracks list
# Add algorithms to update the timed tracks list (based on <track> elements and based on the media resource)
# Add a sectioning defining WebSRT (backronymed to Web Subtitle Resource Tracks?); acknowledge SubRip in a history section - contact zuggy
#* file format — authoring requirements
#* internal data model
#* file format — downloading requirements, CORS, etc
#* file format — parsing requirements
#* processing model — rendering rules
#:See also [[SRT research]] about compatibility with existing parsers.
# Define processing model for active timed tracks — events, display, etc
# Add requirements to pause playback while active tracks load
# Add DOM API
# Add CSS extensions — propose them to CSSWG

Latest revision as of 15:14, 27 September 2014

This document is obsolete.

For the current specification, see: HTML: <track>


This page contains notes for the development of the first version of timed track features in HTML.

See also use cases for timed tracks rendered over video by the UA, use cases for API-level access to timed tracks.

Requirements

Subtitle/Caption/Karaoke File Format

  • per-cue in/out times
    • relative timings would be useful while editing, but may not be necessary in the published format
  • inline time cues for karaoke
  • bidi, newlines, ruby, italics, bold [there's been no evidence provided that there's any need for more fine-grained control at a per-cue level]
  • voice selection (so that e.g. sfx descriptions and each character can be a different colour)
  • per cue vertical position: % of vertical video height (default 100%)
  • per cue horizontal position: % of horizontal video width (default 50%)
  • per cue direction: horizontal/vertical (default horizontal)
  • per cue width/height: % (default is remaining space on line given alignment)
  • per cue text alignment: start/middle/end (default middle for subtitles, start for captions)
  • multiple cues placed in adjacent places (e.g. from different voices or with slightly different times) would need to automatically stack so they don't overlap
    • but should support multiple cues from multiple voices on the same "line", e.g. when two people both utter something at the same time (need an example of this).

(Percentage positions would work like background-position in CSS.)

Track-wide formatting

  • color of background/text/outline is needed for readability on different types of video, unless UA default has clear contrasting outlines or an opaque background.
  • webfonts is needed to provide high quality subtitles in some non-Latin languages (e.g. Chinese where a suitable font is unlikely to be available even on Chinese computer systems).
  • providing a pseudo-element to style each voice would likely be sufficient for authors who want overall formatting control (this would also allow user overrides conveniently)

HTML

  • an API and UI for exposing what timed tracks exist and selectively enabling/disabling them
  • format for external subtitles/captions
  • format for external audio descriptions
  • some mechanism for text in the page to be used instead of external files, for subtitles/captions or audio description
  • an API to allow a segment to be dynamically inserted into the rendering on the fly
  • an API for exposing what the currently relevant segments of each timed track are
  • a way to hook into this mechanism to advance slides
  • native rendering of subtitles
  • native rendering of audio descriptions
  • native rendering of multiple audio or video tracks, to allow pre-recorded audio descriptions to be mixed in and sign language video to be overlaid
  • a way to hook into this to manually render timed tracks


Architecture

<img src="http://docs.google.com/drawings/pub?id=1GR6Pzq0GY2n1sx_ZjDfuICM2LnXxLVxzvyl4kuQy-48&w=640&h=480">

The caption format should be reasonable for either a web engine or a media engine to render, since implementation strategies may differ.

Declaring timed tracks

Each timed track is either:

  • enabled, in which case it is downloaded, triggers events, and if appropriate is rendered by the user agent; or
  • disabled, in which case it does nothing

The enabled/disabled state is by default based on user preferences and the kind of timed track as described below, but can be overridden on a per-track basis.

Each timed track has a kind which is one of:

  • for visual display (subtitles, captions, translations), enabled based on user preferences, shows in video playback area
  • for audio playback (text audio descriptions), enabled based on user preferences, renders as audio
  • for navigation (chapter titles), enabled by default, shows in UA UI
  • for off-video display (lyrics), disabled by default in this version, not shown by UA
  • for metadata (slide timings, annotation data for app-rendered annotations), enabled by default, not shown by UA

Tracks that are for visual display or audio playback have additionally a user-facing label and a language.

Tracks that are for visual display have an additional boolean indicating if they include sound effects and speaker identification (intended for the deaf, hard of hearing, or people with sound muted) or not (i.e. translations intended for people with audio enabled but who cannot understand the language, or karaoke lyrics).

Each timed track associated with a media resource, like the media resource itself, can have multiple sources.

Each source for a timed track has:

  • URL
  • type (if there are multiple sources)
  • media

The media resource can also imply certain timed tracks based on data in the media resource.

The script can also add "virtual" timed tracks dynamically.

Markup

 <track src="" enabled="true" kind="" label="" lang=""></track>
 
 <track enabled="true" kind="" label="" lang="">
  <source src="" type="" media="">
  ...
 </track>

enabled="" is true or false.

Values for kind="":

  • subtitles (includes karaoke) - default
  • captions
  • description (text audio descriptions)
  • chapters
  • lyrics
  • metadata

Questions:

  • Should we use lang="", hreflang="", srclang=""?
  • Is there a better solution to enabled=false for disabling tracks by default? Do we ever need to disable a track that might be enabled by default?

Visual titles

File format

Based on studying a broad range of Timed track formats, there does not appear to be a format that is easy to read and write, supports automatic positioning to avoid overlapping titles while still supporting some level of positioning control, supports temporally-overlapping titles, uses video-independent positioning instead of pixel-based (for visual) or frame-based (for temporal) positioning, and supports some inline structure for ruby, italics, bold, and karaoke.

The two formats that are the cleanest in terms of existing syntax, that are a subset of the above feature set, and that can be extended relatively cleanly in a backwards-compatible way are the FAB subtitler format and the SRT format. The former, however, lacks much documentation. The latter appears to be more well-known.

Proposal: http://damowmow.com/temp/srtspec

CSS extensions

Cues are rendered as block boxes with inline boxes. Cues have a voice (identified by a keyword or a number). Cues can have a part that is before the current time and a part after the current time.

The block box is matched by the pseudo-element ::cue on the media element (<video>). Only visible cues are matched (those on tracks enabled and shown by the UA whose start/end time range contains the current time). The ::cue pseudo takes an optional argument that is the voice of cues that it is to match. The keyword "*", matching all voices, is assumed if the argument is absent.

video::cue { color: white; background: rgba(0,0,0,0.5); font: 900 sans-serif; text-transform: uppercase; }
video::cue(narrator) { color: white; font-style: italics; }
video::cue(1) { color: yellow; }
video::cue(2) { color: lime; }

The ::cue pseudo when given _two_ arguments matches all innermost inline boxes in the cue of the element that match its second argument. Its first argument is a voice; the keyword "*" matches all voices. Its second argument is one of "i", "b", "ruby", "rt" (matches inline boxes immediately inside one of those annotations), "before", "after" (fragments before/after the current time).

video::cue(*, i) { font-style: italic; }
video::cue(narrator, i) { font-style: bold; }
video::cue(*, b) { font-size: larger; }

[This isn't great. Any better ideas?]

DOM API

HTMLMediaElement
 attribute MediaTrack[] tracks;
 MutableMediaTrack addTrack(label, kind, language);
MediaTrack
 readonly attribute DOMString label;
 readonly attribute DOMString kind; // subtitles, captions, descriptions, chapters, lyrics, metadata
 readonly attribute DOMString language;
 readonly attribute unsigned short mode;
   const unsigned short TRACK_OFF = 0; // not firing events, may not even be downloaded yet
   const unsigned short TRACK_HIDDEN = 1; // firing events but otherwise ignored by UA - intended for scripts
   const unsigned short TRACK_SHOWING = 2; // browser is handling it
 readonly attribute MediaCue[] cues; // sorted in startTime order
 readonly attribute MediaCue[] activeCues; // sorted in endTime order?
 readonly attribute Function onentercue; // fires CueEvent
 readonly attribute Function onexitcue; // fires CueEvent
MutableMediaTrack: MediaTrack
 void addCue(cue); // throws if cue.track != null
 void removeCue(cue); // throws if cue isn't in this track
MediaCue
 readonly attribute MediaTrack track; // null if newly created and not yet added to a track
 readonly attribute DOMString id; // empty string if not applicable
 readonly attribute float startTime;
 readonly attribute float endTime;
 readonly attribute boolean snapToLines;
 readonly attribute long linePosition;
 readonly attribute long textPosition;
 readonly attribute long size;
 readonly attribute DOMString direction; // horizontal, vertical
 readonly attribute DOMString alignment; // start, middle, end
 readonly attribute DOMString voice; // for styling purposes
 DOMString getCueAsSource(); // returns the cue as it was expressed in the file (for XML formats, this reserializes, expressing all namespaces appropriately)
 DocumentFragment getCueAsHTML(); // returns a copy of the cue as HTML, with the current position in the case of karaoke lyrics annotated using a ProcessingInstruction or some such; throws if the format doesn't define a conversion to HTML
 Constructor for MediaCue: new MediaCue(id, startTime, endTime, settings, text); // settings and text get parsed like the cues in the main format, whatever that ends up being
CueEvent
 readonly attribute MediaCue cue;
HTMLTrackElement
 readonly attribute MediaTrack track;

Other minor things

We need to make sure that media playback is paused until all enabled timed tracks are locally available.

We need to block cross-origin tracks (eventually blocking only those that aren't CORS-enabled).

Open issues

Synchronised media

For now, sign-language and alternate or additive audio tracks (e.g. audio description tracks) have to be in-band, because UA vendors are refusing to implement synchronisation of external media tracks for now.

However, we should bear it in mind. Adding that kind of thing to the API is going to be non-trivial. The simplest way is probably to just to require that the authors use multiple <video>/<audio> elements and we link them somehow; with one designated as the "sync clock" with them all syncing to it, rather than having each <video> element expose multiple "buffered" "seekable" etc.

Streaming

Do we need to handle live transcription and streaming titles in external files? If so, how?

For now, it's not clear if there are any use case for streaming external timed track resources.

Web based radio might benefit from serving a live audio stream with song title and other details like a artist URL, but it's not clear that this needs to be a timed track (it could be a WebSocket or EventSource feed).


Specification approach

  1. Add <track> element
  2. Add concept of a media element's timed tracks list
  3. Add algorithms to update the timed tracks list (based on <track> elements and based on the media resource)
  4. Add a sectioning defining WebSRT (backronymed to Web Subtitle Resource Tracks?); acknowledge SubRip in a history section - contact zuggy
    • file format — authoring requirements
    • internal data model
    • file format — downloading requirements, CORS, etc
    • file format — parsing requirements
    • processing model — rendering rules
    See also SRT research about compatibility with existing parsers.
  5. Define processing model for active timed tracks — events, display, etc
  6. Add requirements to pause playback while active tracks load
  7. Add DOM API
  8. Add CSS extensions — propose them to CSSWG