A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

HTML vs. XHTML

From WHATWG Wiki
Revision as of 18:57, 3 December 2006 by Rubys (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3. And that the proper way to tell the two apart is via MIME types.

There are only two problems with that. XHTML is not as different from HTML as RDF/XML is from N3. And MIME types can't be relied on. Let's take each in turn.

Syntax

  • Both N3 and RDF/XML are used to express sets of RDF triples. They are equally capable: every triple store can be dumped into either format. The analogy here is the DOM. It is not currently the case that every DOM tree can be dumped equally capably into either format.
  • N3 and RDF/XML are not the same, nor do they even look similar. They are different from top to bottom. Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.

Mime Types

  • People have consistently proven that they can't be trusted to configure and set MIME types correctly. Most aren't even aware that MIME types exist. The default setup with Apache is to not allow overrides. One popular use case is for documentation that is served via file:/// URIs directly from your hard disk.
  • HTTP as specified indicates that the the Content-Type header is authoritative - it trumps the XML prolog. HTTP as practiced treats the MIME type as a hint. Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.

Ideals

In an ideal word:

  • the syntax of XML and HTML would be either complete identical or completely different.
  • the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
  • Content-Type would either aways be respected, or always be ignored.
  • there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.

Analysis

Obviously, the current situation is less than ideal. XML and HTML evolved from a common ancestor. XML isn't changing. And the constraint to be as backwards compatible with HTML4 as humanly possible places practical limits on what can be done. Neither being absolutely identical with the XML syntax nor being completely different are options.

At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax. Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.

Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g., <p/>) but also has some advantages. Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5. Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.

Known differences

Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

Syntax:

  • xmlns is required in XHTML5, and not allowed in HTML5
  • doctype is required in HTML5, optional in XHTML5
  • empty element syntax is a parse error for non-void elements in HTML5, furthermore the prescribed error recovery for this parse error differs from the XML parsing rules.
  • XHTML5 is a XML 1.0 vocabulary. The valid set of unicode characters is limited in XML 1.0, more so than in HTML4/5.
  • XML processes are, by design, unforgiving of parse errors.

Semantics:

  • element names are case insensitive in HTML5, but case sensitive in XHTML5 - this affects both CSS and JavaScript
  • Foreign vocabularies, like SVG, have no HTML5 serialization.

Potential Strategies

Note: these strategies are not necessarily mutually-exclusive.
  • Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver.
  • Identify the xmlns attribute on the html as the definitive marker for XHTML. Additionally, pick one of the following:
    • have it affect the error recovery rules for edge cases like empty non-void elements, and case sensitivity.
    • Have the xmlns attribute be the *one* non-recoverable error defined by HTML5.