HTML vs. XHTML
Differences Between HTML and XHTML
Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.
- Note: As the current WHATWG document is a draft, this section will need to track to a moving target.
There is a community who find it valuable to be able to serve HTML5 documents which are also valid XML documents. They may, for example, use XML tools to generate the document, and they and others may process the document using XML tools. These documents are served as text/html.
This language is sometimes called "polyglot". It is the overlap language of documents which are both HTML5 documents and XML documents. Guidelines are listed below for how one can construct such a polyglot document which will work in either environment. Besides following the well-formedness rules of XML, there are some other restrictions to which one must adhere (for the sake of text/html documents).
This wiki web page is an example of such a document. You can parse it with an XML parser or an HTML parser.
|Feature||HTML Requirement||XHTML Requirement||Notes|
|Mime Type||Must use
||Must use an XML MIME type, such as
||It is the MIME type that determines what type of document you are using. Any document, including a document authored with the intention of being XHTML, served as |
Note that XHTML 1.0 previously defined that documents adhering to the compatibility guidelines were allowed to be served as
text/html, but HTML 5 now defines that such documents are HTML, not XHTML.
Syntax and Parsing
XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today. The following table describes the differences between how each is parsed.
The column on "Guidance for XHTML-HTML compatibility" lists ways in which a document can be crafted to work in either XHTML or HTML. The item will be bolded if it is a requirement for XHTML-compliant code to be changed, since XHTML will otherwise usually work as HTML, at least if its full features are constrained.
|Feature||HTML Requirement||XHTML Requirement||Notes||Guidance for XHTML-HTML compatibility|
|Parsing Modes||Three parsing modes are defined: no quirks mode, quirks mode and limited quirks mode. The mode is only ever changed from the default by the HTML parser, based on the presence, absence, or value of the DOCTYPE string, respectively.||XML parsing rules are used. There is only one mode.||The parsing modes in HTML also have an effect upon script and stylesheet processing. XHTML is considered to be in no quirks mode for these purposes.||Use an explicit |
|Error Handling||HTML does not have a well-formedness constraint, no errors are fatal. Graceful error handling and recovery procedures are thoroughly defined.||Well-formedness errors are fatal||Ensure there are no well-formedness errors.|
|Character Encoding (including XML Declaration,
||The XML declaration is forbidden (treated as a bogus comment, but such style of comments are deprecated), but the
If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).
|The XML declaration may be used to specify the character encoding, while
The default character encoding for XHTML is, according to XML rules,
|If you need to include XML 1.1-only markup, if you do not wish to convert the encoding of the document to UTF-8 or UTF-16 (since use of other encodings also requires a declaration), or if you wish to define an external SYSTEM DTD in the DOCTYPE but use standalone=yes (redundant?), you must use an XML Declaration for XHTML, but this may not be allowable in the future in HTML. For future compatibility, it would be best to avoid XML 1.1-only markup, convert to UTF-8 or UTF-16 (probably UTF-8 which could allow use of a |
|Namespaced elements||Elements and attributes for known vocabularies (HTML, SVG and MathML) are implicitly assigned to appropriate namespaces, according to the rules specified in the parsing algorithm. Elements in the HTML, SVG, or MathML namespaces may have an
||The HTML namespace must be declared for HTML elements according to the rules defined by the Namespaces in XML specification. Namespaces must be explicitly declared. The
||Declare HTML namespaces (or other namespaces) explicitly. Do not depend on the behavior of foreign namespaced elements in an HTML setting; if you need to include these, you will probably wish to set this foreign markup via CSS to |
|Namespaced attributes on HTML elements||Attributes of the form
||Do not use namespaced attributes on HTML elements. Do not depend on the behavior of foreign attributes in an HTML setting.|
|Namespace attributes on foreign elements||
Elements in the SVG namespace may have an
Elements in the MathML namespace may have an
Foreign elements may also have an
When parsed by an HTML parser, the
|The SVG and MathML namespaces must be declared for SVG and MathML elements, respectively, according to the rules defined by Namespaces in XML. The
|XLink attributes||Foreign elements may use the attributes
||XLink attributes may be specified on foreign elements using any prefix, subject to the conformance rules defined by Namespaces in XML. The XLink namespace must be declared according to the conformance rules defined by Namespaces in XML if XLink attributes are used within the document.||Do not use XLink attributes on HTML elements and do not depend on them on foreign elements as will not work as such in HTML. If being used, ensure they have the appropriate XLink namespace defined.|
Foreign elements may use the attributes
HTML elements may use the
|Any element, including HTML elements, may use the attributes
||Though they can be used on foreign elements, do not use |
|Attributes||Names are not case sensitive. Attribute minimization is allowed (i.e. omitting the equals sign and the value)||Names are case sensitive (and lower case). Attribute minimization is not allowed.||Use lower case attribute names. Do not minimize attributes.|
|Attribute values||White space characters are not normalized. Unquoted attribute values are allowed. Fixed or default attribute values ...?||White space characters are normalized to single spaces (unless attribute is of CDATA type?). Unquoted attribute values are not allowed. Default attribute values could conceivably be defined with a DTD.||Create whitespace in attribute values which is already normalized (converted to single spaces). Always quote attribute values. Do not rely on defining default or fixed attribute values in a DTD (unless it matches HTML behavior).|
|Space characters||The space characters are defined as:
||The space characters are defined as:
||The difference is the inclusion of Form Feed. Form feed characters are discouraged in XML 1.1.||Do not use the form feed character.|
A DOCTYPE is a mostly useless, but required, header. The DOCTYPE is used during parsing to determing the parsing mode. The keywords "
Conforming HTML documents are required to use
When using the obsolete but conforming DOCTYPEs based on the HTML 4.0 and 4.01 Strict DTDs, the system identifier is optional. The obsolete but conforming DOCTYPEs based on XHTML 1.0 Strict and XHTML 1.1 may also be specified.
Use of an internal subset is forbidden. The system identifier is never de-referenced by HTML implementations.
The DOCTYPE is optional. XML rules for case sensitivity apply (everything is case sensitive).
Either of the DOCTYPEs defined in HTML5 may be used, or any other custom DOCTYPE. If the public identifier is specified, the system identifier must also be specified. The obsolete status of the obsolete permitted DOCTYPEs defined for HTML does not apply to XHTML. Any DOCTYPE may be used, subject to the conformance rules defined by XML.
Use of an internal subset is permitted according to the requirements of XML. Some validating XML processors may dereference the system identifier, if used, but most browsers use non-validating processors.
|Use the empty DOCTYPE with no SYSTEM or PUBLIC identifiers and no use of internet subset.|
|Element names||Element names are case insensitive.||Element names are case sensitive and lower-case.||Only use lower-case element names (as with attributes).|
|Void vs. Non-void Elements||Void elements only have a start tag; end tags must not be specified for void elements, and it is impossible for them to contain any content. A trailing slash may optionally be inserted at the end of the element's tag, immediately before the closing greater-than sign. For non-void elements (e.g., <script>), the trailing slash is a parsing error (ignored and thus treated as unclosed).||Void elements may use either the empty-element tag syntax (EmptyElemTag) or use a start tag immediately followed by an end tag, with no content in between. While it is possible for the element to contain content, this is non-conforming.||For void elements (e.g., <br />), do not include content or use a closing tag; only use a self-closing element with closing slash at the end (with a space preceding it for the sake of older browsers). For non-void elements, i.e., where content can exist (e.g., <script>), always use an explicit closing tag (not a self-closing tag) even if there is no content.|
|Unexpected end tags||Unexpected end tags (in HTML, an unexpected
||Unexpected end tags are well-formedness errors.||Do not add end tags unless there is an explicit and properly nested open tag before it.|
|End tag with attributes||?||An end tag with attributes is not allowed.||Do not use end tags with attributes.|
|Raw text elements|
For some elements, the start and/or end tags are optional and are implied by certain specified conditions. For example, the end tag for the
Omitting the end tag for other elements is a parse error and various error recovery procedures are applied appropriately.
|End tags must be explicitly included for all elements, except empty elements using the EmptyElemTag syntax.||Always use end tags (or self-closing tags for void elements).|
|Comment syntax||Comments must start with the four character sequence "
||The content of comments must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a hyphen. Violating this is a well-formedness error.||Only use comments of the "|
|Processing Instructions||HTML does not allow processing instructions and deprecates the bogus comments which appear in their form, whether in the form
||XHTML allows the use of XML processing instructions which are only closed by "?>".||Avoid ">" inside processing instructions (as these will close the "instruction" (comment) prematurely) (or one must strip out processing instructions entirely). Processing instructions might need to be avoided entirely in case HTML may in future disallow them completely.|
||Ensure sequence "|
|Unescaped Special Characters||
Unescaped ampersands (U+0026 AMPERSAND -
Unescaped less than signs (U+003C LESS-THAN SIGN -
|Unescaped ampersands and less-than signs may not appear within CharData or AttValue (basically, the normal text content of elements and attribute values.) Violation of this constraint is a well-formedness error.||Always escape ampersands and less-than signs in text content and attribute values. See CDATA for need to escape sequence "|
|Entity References||In HTML, all entity references are predefined and do not require a DTD.||There is no formal DTD for XHTML5, but one could provide an exteranl DTD (if not an internal subset?) for use with one's entity-checking (or validating) parser, but be aware that browsers do not universally use external entity-checking (or validating) parsers and may not read the external DTD. (Some still have bugs in that they mistakenly create a well-formedness error out of such missing entities instead of showing them as missing, making them clickable, or using a entity-checking or validating parser.)||Do not use entity references in XHTML (except for the 5 predefined entities: |
|Character data||The valid set of unicode characters in XML 1.0 is limited beyond that in HTML (we need to specify this here).|
- In HTML, the
styleelements are parsed as
CDATAelements. (Note: the definition of
CDATAdiffers from that in XML). In XML, they're parsed as normal elements (which means that things that look like comments are treated as real comments, and things that look like start tags actually are start tags).
- In HTML, the
textareaelements are parsed as
RCDATAelements. (Note: The definition of
RCDATAdiffers from that in SGML and there is no
- In HTML, if scripting is enabled, the
noscriptelement is parsed as an
CDATAelement. If scripting is disabled, it's parsed as a normal element. In XHTML, the element is always parsed as a normal element, and can't really be used to stop content from being present when script is disabled.
- In HTML, the
noframeselements are parsed as
CDATAelements. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used.
- In HTML, tags for certain elements, which appear out of context, are ignored. This includes
- In XHTML,
tableelements may contain child
trelements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
plaintextelement has a special parsing requirement in HTML. (It is, however, forbidden.)
- In HTML, a line feed that immediately follows a
textareastart tag is ignored.
- Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML. (such as?)
- The following are void elements in HTML (see void elements in table): In head (
meta), in body (
HTML Elements with Optional Tags
|Element||Start Tag||End Tag|
document.writeln()cannot be used in XHTML, they can in HTML.
- In XHTML, the use of the
innerHTMLproperty requires that the string be a well-formed fragment of XML.
- DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace)
- Element.tagName and Node.nodeName return the value in uppercase in HTML but lower-case in XHTML (Node.localName is consistent now, as of HTML5).
- Document.createElement() is case insensitive (the canonical form is lowercase).
- Element.setAttributeNode() will change the attribute name to lowercase.
- Element.setAttribute() is case insensitive (the canonical form is lowercase).
- Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
- Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name will be lowercased before the rename takes place.
- In HTML, Document.createElement() will create an element in the HTML namespace. In XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
- In XHTML, browsers lack interoperability in this area. In Firefox and Safari, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element.
- XPath expressions targeted at pre-HTML5 browsers need to use the XHTML namespace for XHTML and null for HTML. (HTML5 browsers would use the XHTML namespace even in HTML.)
- Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
- CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.
- For polyglot documents, use lower-case element selectors and style the html and body elements appropriately (?).
Differences Between HTML4 and HTML5
Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs
This section might belong on a separate page.
- TODO (need to talk about the changes to the DOM API that HTML5 is making, compared with DOM2 and DOM3)