A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

CDATA Escapes

From WHATWG Wiki
Revision as of 20:05, 30 September 2009 by Zcorpan (talk | contribs) (→‎Proposal #3)
Jump to navigation Jump to search

Requirements

Hard Requirements

  • It must be possible to have the string "</script>" in a string literal in inline JavaScript without having to use JS-level escapes. (This possibility may be limited to scripts that use the <!-- ... --> "Hide from old browsers" pattern.)
  • It must be possible to have "<!--" and "-->" in string literals in inline JavaScript without having to use JS-level escapes.
  • Must not rewind and reparse with different rules.

Medium Requirements

  • It should be possible to have the string <!-- in xmp without having the rest of the page eaten up into xmp element.
  • It should be possible to have <!-- near the start of a script or style element without having a matching --> and still the trailing part of the page shouldn't get eaten up into the script or style element.
  • Pages authored naively for HTML5-parsing-enabled UAs shouldn't be XSS risks in legacy UAs.
  • When the author uses comment-like syntax in the fallback markup in iframe, noembed or noframes, the comment-like syntax should span the same character run that it would if it were parsed as markup.

Nice to Have Requirements

  • It would be nice for the rest of the page not to get eaten up when the author omits </title> accidentally or mistypes it as <title>.

Proposal #3

  • Maybe remove <!-- ... --> escapes from style, title, textarea and xmp. (I think only script needs it.)
    • Need to investigate whether this helps or hurts pages.
  • For script, when in an escaped text span, set a flag after having seen "<script" followed by whitespace or slash or greater-than. "</script" followed by whitespace or slash or greater-than only closes the element if the flag is not set, and otherwise emits the text and resets the flag.

Proposal #2 (FAIL)

  • Remove <!-- ... --> escapes from title, textarea and xmp.
  • For script and style, limit when <!-- takes escaping effect so that it only takes escaping effect if there has been either nothing or only whitespace on the same line before it.
    • Or: limit when <!-- takes escaping effect so that it only takes escaping effect if there has been either nothing or only whitespace before it in the element.

This proposal fails for these http://philip.html5.org/data/pages-with-unclosed-scripts-and-comment-stuff.txt because they use <!-- as normal but have no -->.

Proposal #1 (FAIL)

  • Remove <!-- ... --> escapes from title, textarea and xmp.
  • Make the closing condition for <!-- ... --> in iframe, noembed and noframes match the comment closing conditions exactly.
  • Remove <!-- ... --> escapes from script and style and introduce a novel string literal detector heuristic.

The proposal below is a failure. Due to lone quotes appearing inside regexp literals on the same line with the closing </script> in the wild according to Philip.

String Literal Detector Heuristic (FAIL)

CDATA

<
TAG_OPEN_NON_PCDATA with CDATA as return state
/
CDATA_SLASH
"
CDATA_DOUBLE_QUOTED
'
CDATA_SINGLE_QUOTED
Anything else
Stay

CDATA_SLASH

<
TAG_OPEN_NON_PCDATA with CDATA as return state
/
CDATA_LINE_COMMENT
*
CDATA_COMMENT
Anything else
CDATA

CDATA_DOUBLE_QUOTED

"
Line feed
CDATA
\
CDATA_DOUBLE_QUOTED_BACKSLASH
Anything else
Stay

CDATA_SINGLE_QUOTED

'
Line feed
CDATA
\
CDATA_SINGLE_QUOTED_BACKSLASH
Anything else
Stay

CDATA_DOUBLE_QUOTED_BACKSLASH

Line feed
CDATA
Anything else
CDATA_DOUBLE_QUOTED

CDATA_SINGLE_QUOTED_BACKSLASH

Line feed
CDATA
Anything else
CDATA_SINGLE_QUOTED

CDATA_LINE_COMMENT

Line feed
CDATA
<
TAG_OPEN_NON_PCDATA with CDATA_LINE_COMMENT as return state
Anything else
Stay

CDATA_COMMENT

*
CDATA_COMMENT_ASTERISK
<
TAG_OPEN_NON_PCDATA with CDATA_COMMENT as return state
Anything else
Stay

CDATA_COMMENT_ASTERISK

/
CDATA
<
TAG_OPEN_NON_PCDATA with CDATA_COMMENT as return state
Anything else
CDATA_COMMENT

As Image

The transitions as an image.

Heuristic Design Notes

The heuristic doesn't attempt to detect single or double quotes appearing inside a regular expression literal, because detecting regular expression literals properly is complicated and it's unlikely that </script> would appear in a string literal on the same line with such a regular expression. (I'm assuming the inline minified JS with a minifier that can't escape </script> is a non-issue. This assumption was very bad and makes the whole thing fail.)