HTML vs. XHTML

An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3. And that the proper way to tell the two apart is via MIME types.

There are only two problems with that. XHTML is not as different from HTML as RDF/XML is from N3. And MIME types can't be relied on. Let's take each in turn.

Syntax

Both N3 and RDF/XML are used to express sets of RDF triples. They are equally capable: every triple store can be dumped into either format. The analogy here is the DOM. It is not currently the case that every DOM tree can be dumped equally capably into either format.
N3 and RDF/XML are not the same, nor do they even look similar. They are different from top to bottom. Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.

Mime Types

People have consistently proven that they can't be trusted to configure and set MIME types correctly. Most aren't even aware that MIME types exist. The default setup with Apache is to not allow overrides. One popular use case is for documentation that is served via file:/// URIs directly from your hard disk.
HTTP as specified indicates that the the Content-Type header is authoritative - it trumps the XML prolog. HTTP as practiced treats the MIME type as a hint. Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.

Ideals

In an ideal word:

the syntax of XML and HTML would be either complete identical or completely different.
the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
Content-Type would either aways be respected, or always be ignored.
there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.

Analysis

Obviously, the current situation is less than ideal. XML and HTML evolved from a common ancestor. XML isn't changing. And the constraint to be as backwards compatible with HTML4 as humanly possible places practical limits on what can be done. Neither being absolutely identical with the XML syntax nor being completely different are options.

At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax. Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.

Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g.,

) but also has some advantages. Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5. Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.

Known differences

Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

A namespace declaration is required in XHTML5, and not allowed in HTML5.
A <!DOCTYPE> is required in HTML5, optional in XHTML5.
Empty element syntax is a parse error for non-void elements in HTML5, furthermore the prescribed error recovery for this parse error differs from the XML parsing rules.
XHTML5 is a XML 1.0 vocabulary. The valid set of unicode characters is limited in XML 1.0, more so than in HTML4/5.
XML processes are, by design, unforgiving of parse errors.
You can omit end tags for a number of elements in HTML, but not XML.
You can use <![CDATA[ ... ]]> syntax of XML, but it means something else in HTML.
You can use PIs in XML, but not in HTML.
The set of known named characters entity references can be depended upon in HTML, but not in XHTML (beyond the 5 predefined ones).
You can omit quotes on attribute values in HTML, but not in XML.
If you forget to escape "&" or "<" characters in HTML, the error handling is different than in XML.
You can include non-HTML elements and attributes in XML, but not in HTML.
<noscript> works in HTML, not XML.
<iframe> fallback content is parsed as text in HTML, but as markup in XML.
Comments with "--" in them work in HTML, but fail in XML.
The DTD internal subset works in XML, but is ignored (or worse) in HTML.
HTML syntax is case-insensitive, XML syntax is case-sensitive.
DOM APIs are case-sensitive in XML, case-insensitive in HTML.
CSS is case-sensitive in XML, case-insensitive in HTML.
You can use namespace prefixes in XML, not in HTML.
The contents of <script> and <style> elements in XML are parsed differently than in HTML.
document.write() works in HTML but not in XML.
Things that look like XML comments are treated as XML comments in XHTML —even inside script or style elements.
meta tags are not examined for character encoding information.
White space characters in attribute values are normalized to spaces in XHTML.
xml:lang and xml:base are valid in XHTML, but not in HTML

Potential Strategies

Note: these strategies are not necessarily mutually-exclusive.

Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)
The definition of HTML5 understandably and correctly puts a higher weight on HTML4 compatibility than XHTML migration. But as a migration aid, identify some unlikely/invalid combination (example: use of the HTML5 DOCTYPE combined with xmlns attribute on the html element combined with the use of a non-xml MIME type) and adjust some (as of yet undefined) set of the HTML5 parsing rules.

HTML vs. XHTML

Contents

Syntax

Mime Types

Ideals

Analysis

Known differences

Potential Strategies

Navigation menu

HTML vs. XHTML

Syntax

Mime Types

Ideals

Analysis

Known differences

Potential Strategies

Navigation menu

Search