A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

Difference between revisions of "HTML vs. XHTML"

From WHATWG Wiki
Jump to: navigation, search
m (HTML vs XHTML moved to HTML vs. XHTML: This is what I intended the first time.)
(Made the differences between HTML and XHTML much more comprehensive)
Line 1: Line 1:
 +
== Differences Between HTML and XHTML ==
 +
 +
Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.
 +
 +
'''Note''': As the current WHATWG document is a draft, this section will need to track to a moving target.
 +
 +
=== MIME Types ===
 +
 +
* XHTML must be served with an XML MIME type, such as <b>application/xml</b> or <b>application/xhtml+xml</b>.
 +
* HTML must be served as <b>text/html</b>. 
 +
 +
=== Parsing ===
 +
 +
XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.
 +
 +
* In XHTML, well-formedness errors are fatal. In HTML, error handling rules are much more graceful. Well-formedness errors, which are also syntax errors in HTML, include the following:
 +
** Unencoded ampersands (<code>&amp;</code>) and less than signs (<code>&lt;</code>) (This does not apply to <code>CDATA</code>).
 +
** Comments containing extra pairs of hyphens or ending with a hyphen. e.g. <code>&lt;!--<var> syntax -- error </var>--&gt;</code> or <code>&lt;!--<var> syntax error -</var>--&gt;</code>.
 +
** Mismatched end tags (does not apply to elements with optional tags)
 +
** Unclosed tags.
 +
** Unexpected characters occuring in or before attribute names.
 +
** Unexpected occurrence of EOF.
 +
** Unexpected characters before the DOCTYPE name.
 +
** Missing DOCTYPE name.
 +
** A <code>PUBLIC</code> identifer in a <code>DOCTYPE</code> without a <code>SYSTEM</code> identifier (Note: including either of these is a syntax error in HTML5; but, in XML only the <code>SYSTEM</code> identifier is allowed to occur on its own).
 +
** End tags with attributes.
 +
* The internal subset is permitted in XML, but forbidden in HTML.
 +
* The sequence of characters &quot;<code>]]&gt;</code>&quot; when it does not mark the end of a <code>CDATA</code> section is a well-formedness error in XHTML, but valid in HTML.
 +
* In XHTML: <code>&lt;![CDATA[...]]&gt;</code> is a <code>CDATA</code> section. In HTML, it's a bogus comment.
 +
* In XHTML, <code>&lt;?foo ...?&gt;</code> is a processing instruction. In HTML, it's a bogus comment.
 +
* In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
 +
* In HTML, the <code>script</code> and <code>style</code> elements are parsed as <code>CDATA</code>. (Note: the definition of <code>CDATA</code> differs from that in XML). In XML, they're parsed as CDATA (which means that comments are treated as <em>real</em> comments).
 +
* In HTML, the <code>title</code> and <code>textarea</code> elements are parsed as <code>RCDATA</code>. (Note: The definition of <code>RCDATA</code> differs from that in SGML and there is no <code>RCDATA</code> in XML).
 +
* In HTML, if scripting is enabled, the <code>noscript</code> element is parsed as <code>CDATA</code>. If scripting is disabled, it's parsed as <code>PCDATA</code>.
 +
* In HTML, the <code>iframe</code>, <code>noembed</code> and <code>noframes</code> elements are parsed as <code>CDATA</code>.
 +
* White space characters in attribute values are [http://www.w3.org/TR/REC-xml/#AVNormalize normalized] to spaces in XHTML.
 +
* Elements with optional tags are implied in certain conditions.
 +
* <code>base</code>, <code>link</code>, <code>meta</code>, <code>style</code> and <code>title</code> elements with tags occurring in the body are moved inserted into the head.
 +
* In HTML, tags for certain elements, which appear out of context, are ignored. This includes <code>caption</code>, <code>col</code>, <code>colgroup</code>, <code>frame</code>, <code>frameset</code>, <code>head</code>, <code>option</code>, <code>optgroup</code>,        <code>tbody</code>, <code>td</code>, <code>tfoot</code>, <code>th</code>, <code>thead</code>, <code>tr</code>.
 +
* The <code>plaintext</code> element has a special parsing requirement in HTML. (it is, however, forbidden).
 +
* <em>Many other special handling of edge cases and error conditions, not all of which are listed here, also occur in HTML.</em>
 +
 +
=== Syntax ===
 +
 +
* In HTML, [http://blog.whatwg.org/faq/#doctype the <code>doctype</code> is required]. In XHTML, it is optional.
 +
* In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
 +
* In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
 +
** <code>html</code> (both)
 +
** <code>head</code> (both)
 +
** <code>body</code> (both)
 +
** <code>li</code> (end tag)
 +
** <code>dt</code> (end tag)
 +
** <code>dd</code> (end tag)
 +
** <code>p</code> (end tag)
 +
** <code>colgroup</code> (both)
 +
** <code>thead</code> (end tag)
 +
** <code>tbody</code> (both)
 +
** <code>tfoot</code> (end tag)
 +
** <code>tr</code> (end tag)
 +
** <code>td</code> (end tag)
 +
** <code>th</code> (end tag)
 +
* In XHTML, empty elements may use either the empty element syntax (<code>&lt;br/&gt;</code>) or have an end tag immediately follow the start tag (<code>&lt;br&gt;&lt;/br&gt;</code>). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and should be omitted. End tags for void elements are forbidden.
 +
** <code>base</code>,<code> link</code>, <code>meta</code>, <code>hr</code>, <code>br</code>, <code>img</code>, <code>embed</code>, <code>param</code>, <code>area</code>, <code>col</code> and <code>input</code>
 +
** Note: the following are treated as void elements for the purpose in the parsing requirements, but, as they are obsolete and non-standard, the trailing slash is not permitted:  <code>basefont</code>, <code>b</code><code>gsound</code>, <code>spacer</code>, <code>wbr</code>. (although, since these elements are not permitted anyway, it doesn't make much difference).
 +
* HTML allows attribute minimisation (i.e. omitting the value), XHTML does not.
 +
* HTML allows the use of unquoted attribute values, XHTML does not.
 +
* XHTML allows the use of <code>CDATA</code> sections, HTML does not.
 +
* XHTML allows the use of processing instructions, HTML does not.
 +
* In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities: <code>&amp;amp;</code>, <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;apos;)</code>
 +
** You may provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
 +
*  The valid set of unicode characters  in XML 1.0 is limited beyond that in HTML.
 +
* Namespace prefixes are permitted in XHTML. They are forbidden in HTML.
 +
 +
=== Markup ===
 +
 +
* XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not
 +
*  XHTML uses the <code>xml:lang</code> attribute, HTML uses <code>lang</code> instead,
 +
* XHTML allows the use of <code>xml:id</code>. Both HTML and XHTML allow the <code>id</code> attribute (recommended).
 +
* The [http://blog.whatwg.org/faq/#namespace-decl namespace declaration] (<code>xmlns</code> attribute) is required in XHTML, but forbidden in HTML
 +
* In HTML, the <code>noscript</code> element may be used. In XHTML, it is forbidden.
 +
* HTML uses the <code>base</code> element, XHTML uses <code>xml:base</code> instead.
 +
* In XHTML, <code>p</code> elements may contain structured inline level elements including <code>blockquote</code>, <code>dl</code>, <code>menu</code>, <code>ol</code>, <code>ul</code>, <code>pre</code> and <code>table</code>. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
 +
* In XHTML, <code>table</code> elements may contain child <code>tr</code> elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
 +
 +
=== Character Encoding ===
 +
 +
* In XHTML, the XML declaration may be used to [http://blog.whatwg.org/faq/#charset specify the character encoding]. In HTML, the xml declaration is forbidden
 +
* In HTML, the <code>meta</code> element may be used insted. The <code>http-equiv</code> attribute on the <code>meta</code> element is forbidden in XHTML and is ignored if included.
 +
* The default character encoding for XHTML is, according to XML rules, <code>UTF-8</code> or <code>UTF-16</code>. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).
 +
* In XHTML, it is not an error to rely on the default encoding, but in HTML, it is.
 +
 +
=== Scripts ===
 +
 +
* <code>document.write()</code> and <code>document.writeln()</code> cannot be used in XHTML, they can in HTML.
 +
* In XHTML, the use of the <code>innerHTML</code> property requires that the string be a well-formed fragment of XML.
 +
* DOM APIs are case sensitive in XHTML and some are case insensitive in HTML.  (This does not apply to elements which are not in the HTML namespace)
 +
** Element.tagName, Node.nodeName, and Node.localName return the value in uppercase.
 +
** Document.createElement() is case insensitive (the canonical form is lowercase).
 +
** Element.setAttributeNode() will change the attribute name to lowercase.
 +
** Element.setAttribute()  is case insensitive (the canonical form is lowercase).
 +
** Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
 +
** Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name must be lowercased before the rename takes place.
 +
 +
=== Stylesheets ===
 +
 +
* Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
 +
* CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.
 +
 +
== Other Information ==
 +
 +
'''Note: This section should probably be removed, tidied up or moved to the discussion page.''
 +
 
An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3.  And that the proper way to tell the two apart is via MIME types.
 
An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3.  And that the proper way to tell the two apart is via MIME types.
  
Line 7: Line 119:
 
* Both N3 and RDF/XML are used to express sets of RDF triples.  They are equally capable: every triple store can be dumped into either format.  The analogy here is the DOM.  It is not currently the case that every DOM tree can be dumped equally capably into either format.
 
* Both N3 and RDF/XML are used to express sets of RDF triples.  They are equally capable: every triple store can be dumped into either format.  The analogy here is the DOM.  It is not currently the case that every DOM tree can be dumped equally capably into either format.
 
* N3 and RDF/XML are not the same, nor do they even look similar.  They are different from top to bottom.  Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.
 
* N3 and RDF/XML are not the same, nor do they even look similar.  They are different from top to bottom.  Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.
 +
 +
Need to explain how RDF/N3 is relevant! --[[User:Lachlan Hunt|Lachlan Hunt]] 04:43, 4 December 2006 (UTC)
  
 
=== Mime Types ===
 
=== Mime Types ===
  
* People have consistently proven that they can't be trusted to configure and set MIME types correctly.  Most aren't even aware that MIME types exist.  The default setup with Apache is to not allow overrides.  One popular use case is for documentation that is served via <tt>file:///</tt> URIs directly from your hard disk.
+
* People have consistently proven that they can't be trusted to configure and set MIME types correctly.  Most aren't even aware that MIME types exist.  The default setup with Apache is to not allow overrides.  One popular use case is for documentation that is served via <code>file:///</code> URIs directly from your hard disk.
* HTTP as specified indicates that the the <tt>Content-Type</tt> header is authoritative - it trumps the XML prolog.  HTTP as practiced treats the MIME type as a hint.  Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.
+
* HTTP as specified indicates that the the <code>Content-Type</code> header is authoritative - it trumps the XML prolog.  HTTP as practiced treats the MIME type as a hint.  Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.
  
 
=== Ideals ===
 
=== Ideals ===
Line 17: Line 131:
 
In an ideal word:
 
In an ideal word:
 
* the syntax of XML and HTML would be either complete identical or completely different.
 
* the syntax of XML and HTML would be either complete identical or completely different.
 +
** The syntax of HTML and XHTML are completely different. The fact that they look similar on the surface is irrelevant. (see above). --[[User:Lachlan Hunt|Lachlan Hunt]] 04:43, 4 December 2006 (UTC)
 
* the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
 
* the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
* <tt>Content-Type</tt> would either aways be respected, or always be ignored.
+
* <code>Content-Type</code> would either aways be respected, or always be ignored.
 
* there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.
 
* there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.
  
Line 27: Line 142:
 
At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax.  Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.
 
At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax.  Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.
  
Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g., <tt><p/></tt>) but also has some advantages.  Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5.  Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.
+
Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g., <code><p/></code>) but also has some advantages.  Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5.  Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.
 
 
=== Known differences ===
 
 
 
'''Note''': As the current WHATWG document is a draft, this section will need to track to a moving target.
 
 
 
* A namespace declaration is required in XHTML5, and not allowed in HTML5.
 
* A <!DOCTYPE> is required in HTML5, optional in XHTML5.
 
* Empty element syntax is a parse error for non-void elements in HTML5, furthermore the prescribed error recovery for this parse error differs from the XML parsing rules.
 
* XHTML5 is a XML 1.0 vocabulary.  The valid set of unicode characters is limited in XML 1.0, more so than in HTML4/5.
 
* XML processes are, by design, unforgiving of parse errors.
 
* You can omit end tags for a number of elements in HTML, but not XML.
 
* You can use <![CDATA[ ... ]]> syntax of XML, but it means something else in HTML.
 
* You can use PIs in XML, but not in HTML.
 
* The set of known named characters entity references can be depended upon in HTML, but not in XHTML (beyond the [http://www.w3.org/TR/REC-xml/#sec-predefined-ent 5 predefined ones]).
 
* You can omit quotes on attribute values in HTML, but not in XML.
 
* If you forget to escape "&" or "<" characters in HTML, the error handling is different than in XML.
 
* You can include non-HTML elements and attributes in XML, but not in HTML.
 
* <noscript> works in HTML, not XML.
 
* <iframe> fallback content is parsed as text in HTML, but as markup in XML.
 
* Comments with "--" in them work in HTML, but fail in XML.
 
* The DTD internal subset works in XML, but is ignored (or worse) in HTML.
 
* HTML syntax is case-insensitive, XML syntax is case-sensitive.
 
* DOM APIs are case-sensitive in XML, case-insensitive in HTML.
 
* CSS is case-sensitive in XML, case-insensitive in HTML.
 
* You can use namespace prefixes in XML, not in HTML.
 
* The contents of <script> and <style> elements in XML are parsed differently than in HTML.
 
* document.write() works in HTML but not in XML.
 
* Things that look like XML comments are treated as XML comments in XHTML —even inside script or style elements.
 
* <tt>meta</tt> tags are not examined for character encoding information.
 
* <tt>White space</tt> characters in attribute values are [http://www.w3.org/TR/REC-xml/#AVNormalize normalized] to spaces in XHTML.
 
* xml:lang and xml:base are valid in XHTML, but not in HTML
 
  
 
=== Potential Strategies ===
 
=== Potential Strategies ===
Line 65: Line 149:
  
 
* Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)
 
* Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)
* The definition of HTML5 understandably and correctly puts a higher weight on HTML4 compatibility than XHTML migration.  But as a migration aid, identify some unlikely/invalid combination (example: use of the HTML5 DOCTYPE combined with <tt>xmlns</tt> attribute on the <tt>html</tt> element combined with the use of a non-xml MIME type) and adjust some (as of yet undefined) set of the HTML5 parsing rules.
+
* The definition of HTML5 understandably and correctly puts a higher weight on HTML4 compatibility than XHTML migration.  But as a migration aid, identify some unlikely/invalid combination (example: use of the HTML5 DOCTYPE combined with <code>xmlns</code> attribute on the <code>html</code> element combined with the use of a non-xml MIME type) and adjust some (as of yet undefined) set of the HTML5 parsing rules.

Revision as of 04:43, 4 December 2006

Differences Between HTML and XHTML

Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.

Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

MIME Types

  • XHTML must be served with an XML MIME type, such as application/xml or application/xhtml+xml.
  • HTML must be served as text/html.

Parsing

XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.

  • In XHTML, well-formedness errors are fatal. In HTML, error handling rules are much more graceful. Well-formedness errors, which are also syntax errors in HTML, include the following:
    • Unencoded ampersands (&) and less than signs (<) (This does not apply to CDATA).
    • Comments containing extra pairs of hyphens or ending with a hyphen. e.g. <!-- syntax -- error --> or <!-- syntax error --->.
    • Mismatched end tags (does not apply to elements with optional tags)
    • Unclosed tags.
    • Unexpected characters occuring in or before attribute names.
    • Unexpected occurrence of EOF.
    • Unexpected characters before the DOCTYPE name.
    • Missing DOCTYPE name.
    • A PUBLIC identifer in a DOCTYPE without a SYSTEM identifier (Note: including either of these is a syntax error in HTML5; but, in XML only the SYSTEM identifier is allowed to occur on its own).
    • End tags with attributes.
  • The internal subset is permitted in XML, but forbidden in HTML.
  • The sequence of characters "]]>" when it does not mark the end of a CDATA section is a well-formedness error in XHTML, but valid in HTML.
  • In XHTML: <![CDATA[...]]> is a CDATA section. In HTML, it's a bogus comment.
  • In XHTML, <?foo ...?> is a processing instruction. In HTML, it's a bogus comment.
  • In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
  • In HTML, the script and style elements are parsed as CDATA. (Note: the definition of CDATA differs from that in XML). In XML, they're parsed as CDATA (which means that comments are treated as real comments).
  • In HTML, the title and textarea elements are parsed as RCDATA. (Note: The definition of RCDATA differs from that in SGML and there is no RCDATA in XML).
  • In HTML, if scripting is enabled, the noscript element is parsed as CDATA. If scripting is disabled, it's parsed as PCDATA.
  • In HTML, the iframe, noembed and noframes elements are parsed as CDATA.
  • White space characters in attribute values are normalized to spaces in XHTML.
  • Elements with optional tags are implied in certain conditions.
  • base, link, meta, style and title elements with tags occurring in the body are moved inserted into the head.
  • In HTML, tags for certain elements, which appear out of context, are ignored. This includes caption, col, colgroup, frame, frameset, head, option, optgroup, tbody, td, tfoot, th, thead, tr.
  • The plaintext element has a special parsing requirement in HTML. (it is, however, forbidden).
  • Many other special handling of edge cases and error conditions, not all of which are listed here, also occur in HTML.

Syntax

  • In HTML, the doctype is required. In XHTML, it is optional.
  • In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
  • In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
    • html (both)
    • head (both)
    • body (both)
    • li (end tag)
    • dt (end tag)
    • dd (end tag)
    • p (end tag)
    • colgroup (both)
    • thead (end tag)
    • tbody (both)
    • tfoot (end tag)
    • tr (end tag)
    • td (end tag)
    • th (end tag)
  • In XHTML, empty elements may use either the empty element syntax (<br/>) or have an end tag immediately follow the start tag (<br></br>). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and should be omitted. End tags for void elements are forbidden.
    • base, link, meta, hr, br, img, embed, param, area, col and input
    • Note: the following are treated as void elements for the purpose in the parsing requirements, but, as they are obsolete and non-standard, the trailing slash is not permitted: basefont, bgsound, spacer, wbr. (although, since these elements are not permitted anyway, it doesn't make much difference).
  • HTML allows attribute minimisation (i.e. omitting the value), XHTML does not.
  • HTML allows the use of unquoted attribute values, XHTML does not.
  • XHTML allows the use of CDATA sections, HTML does not.
  • XHTML allows the use of processing instructions, HTML does not.
  • In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;)
    • You may provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
  • The valid set of unicode characters in XML 1.0 is limited beyond that in HTML.
  • Namespace prefixes are permitted in XHTML. They are forbidden in HTML.

Markup

  • XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not
  • XHTML uses the xml:lang attribute, HTML uses lang instead,
  • XHTML allows the use of xml:id. Both HTML and XHTML allow the id attribute (recommended).
  • The namespace declaration (xmlns attribute) is required in XHTML, but forbidden in HTML
  • In HTML, the noscript element may be used. In XHTML, it is forbidden.
  • HTML uses the base element, XHTML uses xml:base instead.
  • In XHTML, p elements may contain structured inline level elements including blockquote, dl, menu, ol, ul, pre and table. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
  • In XHTML, table elements may contain child tr elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).

Character Encoding

  • In XHTML, the XML declaration may be used to specify the character encoding. In HTML, the xml declaration is forbidden
  • In HTML, the meta element may be used insted. The http-equiv attribute on the meta element is forbidden in XHTML and is ignored if included.
  • The default character encoding for XHTML is, according to XML rules, UTF-8 or UTF-16. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).
  • In XHTML, it is not an error to rely on the default encoding, but in HTML, it is.

Scripts

  • document.write() and document.writeln() cannot be used in XHTML, they can in HTML.
  • In XHTML, the use of the innerHTML property requires that the string be a well-formed fragment of XML.
  • DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace)
    • Element.tagName, Node.nodeName, and Node.localName return the value in uppercase.
    • Document.createElement() is case insensitive (the canonical form is lowercase).
    • Element.setAttributeNode() will change the attribute name to lowercase.
    • Element.setAttribute() is case insensitive (the canonical form is lowercase).
    • Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
    • Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name must be lowercased before the rename takes place.

Stylesheets

  • Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
  • CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.

Other Information

'Note: This section should probably be removed, tidied up or moved to the discussion page.

An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3. And that the proper way to tell the two apart is via MIME types.

There are only two problems with that. XHTML is not as different from HTML as RDF/XML is from N3. And MIME types can't be relied on. Let's take each in turn.

Syntax

  • Both N3 and RDF/XML are used to express sets of RDF triples. They are equally capable: every triple store can be dumped into either format. The analogy here is the DOM. It is not currently the case that every DOM tree can be dumped equally capably into either format.
  • N3 and RDF/XML are not the same, nor do they even look similar. They are different from top to bottom. Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.

Need to explain how RDF/N3 is relevant! --Lachlan Hunt 04:43, 4 December 2006 (UTC)

Mime Types

  • People have consistently proven that they can't be trusted to configure and set MIME types correctly. Most aren't even aware that MIME types exist. The default setup with Apache is to not allow overrides. One popular use case is for documentation that is served via file:/// URIs directly from your hard disk.
  • HTTP as specified indicates that the the Content-Type header is authoritative - it trumps the XML prolog. HTTP as practiced treats the MIME type as a hint. Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.

Ideals

In an ideal word:

  • the syntax of XML and HTML would be either complete identical or completely different.
    • The syntax of HTML and XHTML are completely different. The fact that they look similar on the surface is irrelevant. (see above). --Lachlan Hunt 04:43, 4 December 2006 (UTC)
  • the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
  • Content-Type would either aways be respected, or always be ignored.
  • there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.

Analysis

Obviously, the current situation is less than ideal. XML and HTML evolved from a common ancestor. XML isn't changing. And the constraint to be as backwards compatible with HTML4 as humanly possible places practical limits on what can be done. Neither being absolutely identical with the XML syntax nor being completely different are options.

At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax. Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.

Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g.,

) but also has some advantages. Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5. Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.

Potential Strategies

Note: these strategies are not necessarily mutually-exclusive.
  • Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)
  • The definition of HTML5 understandably and correctly puts a higher weight on HTML4 compatibility than XHTML migration. But as a migration aid, identify some unlikely/invalid combination (example: use of the HTML5 DOCTYPE combined with xmlns attribute on the html element combined with the use of a non-xml MIME type) and adjust some (as of yet undefined) set of the HTML5 parsing rules.