There appears to be consensus among the WCAG Samurai and the current WGAC 2.0 draft that the primary ways of making video with soundtrack accessible is to provide captioning for the deaf and audio description for the blind. (The WCAG 2.0 draft also mentions full-text alternative as an alternative to audio description.)
Presumably, the captioning and audio description need to be “closed” (off-by-default, available on request) as content providers might hesitate presenting captions to those who do hear the soundtrack or audio description to those who already see the video track.
Technically this is timed text presented in sync with the video track.
It is assumed to be in the same language as the main soundtrack. Content-wise, it is expected to mention semantically important non-verbal sounds and identify speakers when the video doesn't make it clear who is talking.
In terms of player app decisions, this track shouldn't be presented by default but it should play if the user has opted in (perhaps via a permanent setting) to showing captioning. Also, if the player app knows that audio output has been turned off either in the app or in the OS, it might make sense to turn on captioning in that case as well.
Video Format Support
CMML has been put forward as the timed text format for Ogg. (How to mark as closed captions?)
3GPP Timed Text aka. MPEG-4 part 17 is the timed text format for MP4. (How to mark as closed captions?)
Closed Audio Description
Technically this is a second sound track presented in sync with the main sound track.
It is assumed to be in the same language as the main soundtrack.
In terms of player app decisions, this track shouldn't be presented by default but it should play if the user has opted in (perhaps via a permanent setting) to playing audio descriptions. Also, if the player app knows that a screen reader is in use, it might make sense to use that as a cue of turning on audio descriptions.
Video Format Support
How to flag a second sound track (Speex?) as closed audio description?
How to flag a second sound track as closed audio description?
Data Placement in the Web Context
Should the above-mentioned tracks be muxed into the main video file (Pro: all tracks travel together; Con: off-by-default tracks take bandwidth)? Or should they be separate HTTP resources (Pro: bandwidth optimization; Con: Web-specific content assembly from many files may not survive downloading to disk, etc.)
Related Non-Accessibility Features
There are technically similar non-accessibility (i.e. not related to addressing needs arising from a disability) features related to translation.
A site in language A might want to embed a video with the soundtrack in language B but subtitles in language A. For example, a Finnish-language site embedding an English-language video would want to have Finnish subtitles. Unlike captions, these subtitles should be on by default and being able to suppress the subtitles is considered an additional nice-to-have feature.
There are also same-language subtitles (e.g. French subtitles with French-language soundtrack) for language learners. Unlike captions, same-language subtitles don't inform the reader about non-verbal sounds or identify speakers.
Subtitles need different track metadata so that they can be displayed by default. (Due to concerns about the reliability of subtitling technology, many content providers probably opt to burn the subtitles into the video track as part of the image data, even though this disturbs video compression.)
Alternative Dubbed Sound Tracks
Due to bandwidth concerns, Web content providers will probably opt to provide separate video files for dubbed languages.