HTML vs. XHTML
An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3. And that the proper way to tell the two apart is via MIME types.
There are only two problems with that. XHTML is not as different from HTML as RDF/XML is from N3. And MIME types can't be relied on. Let's take each in turn.
- Both N3 and RDF/XML are used to express sets of RDF triples. They are equally capable: every triple store can be dumped into either format. The analogy here is the DOM. It is not currently the case that every DOM tree can be dumped equally capably into either format.
- N3 and RDF/XML are not the same, nor do they even look similar. They are different from top to bottom. Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.
- People have consistently proven that they can't be trusted to configure and set MIME types correctly. Most aren't even aware that MIME types exist. The default setup with Apache is to not allow overrides. One popular use case is for documentation that is served via file:/// URIs directly from your hard disk.
- HTTP as specified indicates that the the Content-Type header is authoritative - it trumps the XML prolog. HTTP as practiced treats the MIME type as a hint. Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.
In an ideal word:
- the syntax of XML and HTML would be either complete identical or completely different.
- the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.
- Content-Type would either aways be respected, or always be ignored.
- there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.
Obviously, the current situation is less than ideal. XML and HTML evolved from a common ancestor. XML isn't changing. And the constraint to be as backwards compatible with HTML4 as humanly possible places practical limits on what can be done. Neither being absolutely identical with the XML syntax nor being completely different are options.
At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax. Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.
Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g.,
) but also has some advantages. Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5. Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.
Note: As the current WHATWG document is a draft, this section will need to track to a moving target.
- A namespace declaration is required in XHTML5, and not allowed in HTML5.
- A <!DOCTYPE> is required in HTML5, optional in XHTML5.
- Empty element syntax is a parse error for non-void elements in HTML5, furthermore the prescribed error recovery for this parse error differs from the XML parsing rules.
- XHTML5 is a XML 1.0 vocabulary. The valid set of unicode characters is limited in XML 1.0, more so than in HTML4/5.
- XML processes are, by design, unforgiving of parse errors.
- You can omit end tags for a number of elements in HTML, but not XML.
- You can use <![CDATA[ ... ]]> syntax of XML, but it means something else in HTML.
- You can use PIs in XML, but not in HTML.
- You can omit quotes on attribute values in HTML, but not in XML.
- If you forget to escape "&" or "<" characters in HTML, the error handling is different than in XML.
- You can include non-HTML elements and attributes in XML, but not in HTML.
- <noscript> works in HTML, not XML.
- <iframe> fallback content is parsed as text in HTML, but as markup in XML.
- Comments with "--" in them work in HTML, but fail in XML.
- The DTD internal subset works in XML, but is ignored (or worse) in HTML.
- HTML syntax is case-insensitive, XML syntax is case-sensitive.
- DOM APIs are case-sensitive in XML, case-insensitive in HTML.
- CSS is case-sensitive in XML, case-insensitive in HTML.
- You can use namespace prefixes in XML, not in HTML.
- The contents of <script> and <style> elements in XML are parsed differently than in HTML.
- document.write() works in HTML but not in XML.
Note: these strategies are not necessarily mutually-exclusive.
- Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)
- Identify the xmlns attribute on the html as the definitive marker for XHTML (meaning what, exactly? If this _doesn't_ mean "and therefore use XML parsing rules for those documents", then what does it mean for a document to be XHTML? If it _does_ mean that, then this would cause about 13% of the Web to start showing error pages on new UAs, which would of course cause those UAs to start ignoring the spec since they can't ship a new version that drops a tenth of the Web on the floor -Hixie). Additionally, pick one of the following:
- have it affect the error recovery rules for edge cases like empty non-void elements, and case sensitivity. (Unfortunately this would break about 15% of existing Web pages, causing them to render differently in new UAs compared to old UAs, and therefore causing those UAs to ignore the spec and not do this -Hixie)
- Have the xmlns attribute be the *one* non-recoverable error defined by HTML5. (I don't understand what that means -Hixie)