A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
CDATA Escapes
Requirements
Hard Requirements
- It must be possible to have the string "</script>" in a string literal in inline JavaScript without having to use JS-level escapes. (This possibility may be limited to scripts that use the <!-- ... --> "Hide from old browsers" pattern.)
- It must be possible to have "<!--" and "-->" in string literals in inline JavaScript without having to use JS-level escapes.
- Must not rewind and reparse with different rules.
Medium Requirements
- It should be possible to have the string <!-- in xmp without having the rest of the page eaten up into xmp element.
- It should be possible to have <!-- near the start of a script or style element without having a matching --> and still the trailing part of the page shouldn't get eaten up into the script or style element.
- Pages authored naively for HTML5-parsing-enabled UAs shouldn't be XSS risks in legacy UAs.
- When the author uses comment-like syntax in the fallback markup in iframe, noembed or noframes, the comment-like syntax should span the same character run that it would if it were parsed as markup.
Nice to Have Requirements
- It would be nice for the rest of the page not to get eaten up when the author omits </title> accidentally or mistypes it as <title>.
Proposal #4 (FAIL)
- For script, let document.write\s*( and document.writeln\s*( start an escaped text span, and let ) end it. Inside the escaped text span, </script does not close the element.
This proposal would break for
<script><!-- var s = '<script></script>'; document.write(s); //--></script>
This proposal would also break for
<script><!-- var w = document.write; w("<script></script>"); //--></script>
Without researching, I suspect that the two patterns above are very common.
Proposal #3
- Maybe remove <!-- ... --> escapes from style, title, textarea and xmp. (I think only script needs it.)
- Need to investigate whether this helps or hurts pages. Done.
- For script, when in an escaped text span, set a flag after having seen "<script". "</script" followed by whitespace or slash or greater-than only closes the element if the flag is not set, and otherwise emits the text and resets the flag. Exiting an escaped text span also resets the flag.
This proposal would fail for
<script><!-- document.write('<scr'+'ipt></script>'); //--></script>
http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt shows that there are a few pages doing the above, e.g.
- www.grandparents.com/gp/content/expert-advice/family-matters/article/thatevildaughterinlaw.html
- www.celebrity-link.com/c106/showcelebrity_categoryid-10687.html
- me.yaplog.jp/viewBoard.blog?boardId=975
This proposal would also fail for
<script> <!-- document.write('<script></scr'+'ipt>'); </script>
This pattern also breaks with what is currently specced, though.
http://philip.html5.org/data/script-open-in-escape.txt shows that there is one site doing the above, e.g.
- www.jeuxactu.com/images-fiche-soul-calibur-legends-8219-4-6.html
Proposal adopted
http://html5.org/tools/web-apps-tracker?from=4177&to=4178
Matches requirements?
Let's see how well this proposal matches the stated requirements:
- It must be possible to have the string "</script>" in a string literal in inline JavaScript without having to use JS-level escapes. (This possibility may be limited to scripts that use the <!-- ... --> "Hide from old browsers" pattern.)
- Satisfied, provided that it's preceded by <!-- and a matching <script>.
- It must be possible to have "<!--" and "-->" in string literals in inline JavaScript without having to use JS-level escapes.
- Satisfied, provided that you don't also use a later "<script>" without a later "</script>" or "-->".
- Must not rewind and reparse with different rules.
- Satisfied.
- It should be possible to have the string <!-- in xmp without having the rest of the page eaten up into xmp element.
- Satisfied.
- It should be possible to have <!-- near the start of a script or style element without having a matching --> and still the trailing part of the page shouldn't get eaten up into the script or style element.
- Satisfied.
- Pages authored naively for HTML5-parsing-enabled UAs shouldn't be XSS risks in legacy UAs.
- ?
- When the author uses comment-like syntax in the fallback markup in iframe, noembed or noframes, the comment-like syntax should span the same character run that it would if it were parsed as markup.
- Not satisfied. However, it seems it doesn't affect any pages either way.
- It would be nice for the rest of the page not to get eaten up when the author omits </title> accidentally or mistypes it as <title>.
- Not satisfied.
Proposal #2 (FAIL)
- Remove <!-- ... --> escapes from title, textarea and xmp.
- For script and style, limit when <!-- takes escaping effect so that it only takes escaping effect if there has been either nothing or only whitespace on the same line before it.
- Or: limit when <!-- takes escaping effect so that it only takes escaping effect if there has been either nothing or only whitespace before it in the element.
This proposal fails for these http://philip.html5.org/data/pages-with-unclosed-scripts-and-comment-stuff.txt because they use <!-- as normal but have no -->.
http://philip.html5.org/data/script-close-in-escape-without-script-open-2.txt has more examples of pages that break with this proposal.
Proposal #1 (FAIL)
- Remove <!-- ... --> escapes from title, textarea and xmp.
- Make the closing condition for <!-- ... --> in iframe, noembed and noframes match the comment closing conditions exactly.
- Remove <!-- ... --> escapes from script and style and introduce a novel string literal detector heuristic.
The proposal below is a failure. Due to lone quotes appearing inside regexp literals on the same line with the closing </script> in the wild according to Philip.
String Literal Detector Heuristic (FAIL)
CDATA
<
- TAG_OPEN_NON_PCDATA with CDATA as return state
/
- CDATA_SLASH
"
- CDATA_DOUBLE_QUOTED
'
- CDATA_SINGLE_QUOTED
- Anything else
- Stay
CDATA_SLASH
<
- TAG_OPEN_NON_PCDATA with CDATA as return state
/
- CDATA_LINE_COMMENT
*
- CDATA_COMMENT
- Anything else
- CDATA
CDATA_DOUBLE_QUOTED
"
- Line feed
- CDATA
\
- CDATA_DOUBLE_QUOTED_BACKSLASH
- Anything else
- Stay
CDATA_SINGLE_QUOTED
'
- Line feed
- CDATA
\
- CDATA_SINGLE_QUOTED_BACKSLASH
- Anything else
- Stay
CDATA_DOUBLE_QUOTED_BACKSLASH
- Line feed
- CDATA
- Anything else
- CDATA_DOUBLE_QUOTED
CDATA_SINGLE_QUOTED_BACKSLASH
- Line feed
- CDATA
- Anything else
- CDATA_SINGLE_QUOTED
CDATA_LINE_COMMENT
- Line feed
- CDATA
<
- TAG_OPEN_NON_PCDATA with CDATA_LINE_COMMENT as return state
- Anything else
- Stay
CDATA_COMMENT
*
- CDATA_COMMENT_ASTERISK
<
- TAG_OPEN_NON_PCDATA with CDATA_COMMENT as return state
- Anything else
- Stay
CDATA_COMMENT_ASTERISK
/
- CDATA
<
- TAG_OPEN_NON_PCDATA with CDATA_COMMENT as return state
- Anything else
- CDATA_COMMENT
As Image
Heuristic Design Notes
The heuristic doesn't attempt to detect single or double quotes appearing inside a regular expression literal, because detecting regular expression literals properly is complicated and it's unlikely that </script> would appear in a string literal on the same line with such a regular expression. (I'm assuming the inline minified JS with a minifier that can't escape </script> is a non-issue. This assumption was very bad and makes the whole thing fail.)