A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Common Subset

From WHATWG Wiki
Revision as of 13:43, 9 December 2006 by Michel Fortin (talk | contribs) (Some corrections.)
Jump to navigation Jump to search

The common subset intersecting HTML5 and XHTML5 is a subset of both syntaxes meant to create . The common subset is only implicitly defined by the HTML and XHTML specification because they have many syntax elements in common. A document is said to use the common subset when it can parse correctly with both the XML parser and the HTML parser.

A document using the conforming common subset is conforming with the specification whether it is interpreted as HTML or XHTML. The conforming common subset rejects any element with are not conforming in either of the two DOM variants.

A document using the common subset can be served as HTML (text/html media type) or XHTML (with an XML media type). The media type is what the browser use to decide if it'll be parsed as HTML or XHTML and which varient of the DOM is used.


Common Syntax

[TBD]

Limitations from HTML

  • The doctype is required to be <!DOCTYPE html>
  • Tag and attribute names must be lowercase.
  • XMLish end tag tailing slash must not be used with non-void elements; they'll be ignored by the HTML parser and are invalid in HTML.
  • XMLish CDATA blocks will not work.
  • HTML does not allow mixing with other XML dialects.
  • HTML does not support p elements may contain structured inline level elements including blockquote, dl, menu, ol, ul, pre and table.
  • script elements cannot contain </.

Limitations from XML and XHTML

  • The doctype, optional in XHTML but mandatory in HTML, must match case-sensitivly this <!DOCTYPE html> to be well-formed and valid in XHTML.
  • Well-formness contrains, not respecting these will generate fatal errors in XHTML.
    • Comments cannot contain double-hyphens (--).
    • Start tags and end tags must be balenced correctly, unless they're void element.
    • Void tags must always be closed by a tailing slash (/>).
    • All attributes values must be quoted. Attributes without value are disallowed.
    • All < and & in the text must be escaped, so is > inside attribute values or anywhere in the text when preceded by ]] (where it would be CDATA section end marker).
    • Some characters are illegal in XML (U+0009, U+000A, U+000D, U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF) XML charsets
    • Others constrains defined in the XML specification.
  • script and style elements may not contain < or & in their unescaped form. External scripts and stylesheets are unaffected.
    • There is a trick to allow < and & which involves using CDATA blocks and inside JavaScript comments. See the workarounds section.
  • noscript has no effect in XHTML.
  • document.write() does not work in XHTML.
  • Entity references cannot be used in XHTML (excluding the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;).
  • The namespace declaration (xmlns attribute) is required in XHTML. The xmlns attribute is also allowed to appear on the html element in HTML on the condition that is has the value "http://www.w3.org/1999/xhtml". <html xmlns="http://www.w3.org/1999/xhtml">
  • DOM apis are case-sensitive in XHTML, scripts should always use lowercase to be compatible.
  • Style rules are matching case sensitivly in XHTML, stylesheets should always use lowercase tag, attribute and class names to be compatible.

Markup Issues and Workarounds

Base URI

HTML:

<base src="uri">

XML / XHTML:

<html xml:base="uri">

Workaround: HTTP Content-Location header:

Content-Location: uri

HTML 4 Spec


Character Set

HTML

<meta http-equiv="Content-Type" value="text/html;charset=utf-8">

XHTML / XML

<?xml version="1.0" encoding="utf-8"?>

Workaround: HTTP Content-Type header with encoding specified:

Content-Type: text/html;charset=utf-8
Content-Type: application/xhtml+xml;charset=utf-8


Language

HTML

<html lang="en">

XML / XHTML

<html xml:lang="en">

Workaround: HTTP Content-Language header:

Content-Language: en

HTML 5 Spec HTML 4 Spec

Note that there is no conforming workaround to switch language for different parts of a document. There is a method which will work however: if you use HTML's lang attribute, instead of the conformant xml:lang, browser will correctly deduce the language of the element. But this will make the document non-conforming when served with an XML media type and interpreted as XHTML.


Scripts & Style

HTML

<script type="text/javascript">
if (a < 0 && a > 10) alert("A not in range (0 < a < 10).")
</script>

XML / XHTML

<script type="text/javascript">
if (a &lt; 0 &amp;&amp; a > 10) alert("A not in range (0 &lt; a &lt; 10).")
</script>

or

<script type="text/javascript">
<![CDATA[
if (a < 0 && a > 10) alert("A not in range (0 < a < 10).")
]]>
</script>

Workaround: Commented CDATA block around the problemantic part of the script, or the whole script:

<script type="text/javascript">
/* <![CDATA[ */
if (a < 0 && a > 10) alert("A not in range (0 < a < 10).")
/* ]]> */
</script>

This works because HTML puts the CDATA block markers textually inside the script, but as they're then inside comments it has no effect, and the CDATA block allows the XML parser to work with unescaped character data. The same trick can be applied to the style element when it contains < or &.

Note that the element must not contain the string ]]>, or the XML wouldn't be well-formed, and it may not contain </, or it would be non-conformant HTML. In all cases, this is only needed where script or style contains & or <.