A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

HTML vs. XHTML

From WHATWG Wiki
Revision as of 17:46, 25 June 2010 by Brettz9 (talk | contribs) (→‎Syntax and Parsing: cannot find anywhere official that xml:id is explicitly allowed, so assuming it is not since prefixed namespaces and other xml: attributes are not besides xml:lang from XHTML1)
Jump to navigation Jump to search

Differences Between HTML and XHTML

This page is currently being revised. Some information is incomplete or missing.

Please note that the information in here is based upon the current spec for (X)HTML5. Some of the issues technically do not apply to previous versions of HTML.

Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.

Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

Overlap Language

There is a community who find it valuable to be able to serve HTML5 documents which are also valid XML documents. They may, for example, use XML tools to generate the document, and they and others may process the document using XML tools. These documents are served as text/html.

This language is sometimes called "polyglot". It is the overlap language of documents which are both HTML5 documents and XML documents.

This wiki web page is an example of such a document. You can parse it with an XML parser or an HTML parser.


MIME Types

Feature HTML Requirement XHTML Requirement Notes
Mime Type Must use text/html. Must use an XML MIME type, such as application/xml or application/xhtml+xml. It is the MIME type that determines what type of document you are using. Any document, including a document authored with the intention of being XHTML, served as text/html is technically an HTML document.

Note that XHTML 1.0 previously defined that documents adhering to the compatibility guidelines were allowed to be served as text/html, but HTML 5 now defines that such documents are HTML, not XHTML.

Syntax and Parsing

XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today. The following table describes the differences between how each is parsed.

Feature HTML Requirement XHTML Requirement Notes
Parsing Modes Three parsing modes are defined: no quirks mode, quirks mode and limited quirks mode. The mode is only ever changed from the default by the HTML parser, based on the presence, absence, or value of the DOCTYPE string. XML parsing rules are used. There is only one mode. The parsing modes in HTML also have an effect upon script and stylehsheet processing. XHTML is considered to be in no quirks mode for these purposes.
Error Handling HTML does not have a well-formedness constraint, no errors are fatal. Graceful error handling and recovery procedures are thoroughly defined. Well-formedness errors are fatal
Namespaces Elements and attributes for known vocabularies (HTML, SVG and MathML) are implicitly assigned to appropriate namespaces, according to the rules specified in the parsing algorithm. The rules defined in the Namespaces in XML specification apply. Namespaces must be explicitly declared.
Namespace attributes on HTML elements Elements in the HTML namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/1999/xhtml". The attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the attribute ends up in no namespace.

Attributes of the form xmlns:prefix may not be used on HTML elements.

The HTML namespace must be declared for HTML elements according to the rules defined by Namespaces in XML. The xmlns and xmlns:prefix attributes end up in the "http://www.w3.org/2000/xmlns" namespace.
Namespace attributes on foreign elements

Elements in the SVG namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/2000/svg". The attribute is optional because the namespace is implied during parsing.

Elements in the MathML namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/1998/Math/MathML". The attribute is optional because the namespace is implied during parsing.

Foreign elements may also have an xmlns:xlink attribute specified, if, and only if, it has the exact value "http://www.w3.org/1999/xlink". This attribute is optional, even if XLink attributes are used, because the namespaces for XLink attributes is implied during parsing.

When parsed by an HTML parser, the xmlns and xmlns:xlink attributes end up in the "http://www.w3.org/2000/xmlns" namespace.

The SVG and MathML namespaces must be declared for SVG and MathML elements, respectively, according to the rules defined by Namespaces in XML. The xmlns and xmlns:prefix attributes end up in the "http://www.w3.org/2000/xmlns" namespace.
XLink attributes Foreign elements may use the attributes xlink:actuate, xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title and xlink:type. These attributes are placed in the "http://www.w3.org/1999/xlink". The prefix used must be "xlink". XLink attributes may be specified on foreign elements using any prefix, subject to the conformance rules defined by Namespaces in XML. The XLink namespace must be declared according to the conformance rules defined by Namespaces in XML if XLink attributes are used within the document.
XML attributes

Foreign elements may use the attributes xml:lang, xml:base and xml:space. These attributes are placed in the "http://www.w3.org/XML/1998/namespace". The prefix used must be "xml".

HTML elements may use the xml:lang attribute. The attribute in no namespace with no prefix and with the literal localname "xml:lang" has no effect on language processing (as does "lang". HTML elements must not use the xml:base, xml:space, or xml:id attributes.

Any element, including HTML elements, may use the attributes xml:lang, xml:id, xml:base and xml:space. These attributes are placed in the "http://www.w3.org/XML/1998/namespace". The prefix used must be "xml".
Space characters The space characters are defined as:
  • U+0009 CHARACTER TABULATION
  • U+000A LINE FEED
  • U+000C FORM FEED
  • U+000D CARRIAGE RETURN
  • U+0020 SPACE
The space characters are defined as:
  • U+0009 CHARACTER TABULATION
  • U+000A LINE FEED
  • U+000D CARRIAGE RETURN
  • U+0020 SPACE
The difference is the inclusion of Form Feed.
The DOCTYPE

A DOCTYPE is a mostly useless, but required, header. The DOCTYPE is used during parsing to determing the parsing mode. The keywords "DOCTYPE", "PUBLIC" and "SYSTEM", and the name "html" are treated case insensitively. The system identifier "about:legacy-compat" (and the public and system identifiers for previous versions of HTML) are case sensitive.

Conforming HTML documents are required to use <!DOCTYPE html> (case insensitively) or the legacy-compat version <!DOCTYPE html SYSTEM "about:legacy-compat">.

When using the obsolete but conforming DOCTYPEs based on the HTML 4.0 and 4.01 Strict DTDs, the system identifier is optional. The obsolete but conforming DOCTYPEs based on XHTML 1.0 Strict and XHTML 1.1 may also be specified.

Use of an internal subset is forbidden. The system identifier is never de-referenced by HTML implementations.

The DOCTYPE is optional. XML rules for case sensitivity apply (everything is case sensitive).

Either of the DOCTYPEs defined in HTML5 may be used, or any other custom DOCTYPE. If the poublic identifier is specified, the system identifier must also be specified. The obsolete status of the obsolete permitted DOCTYPEs defined for HTML does not apply to XHTML. Any DOCTYPE may be used, subject to the conformance rules defined by XML.

Use of an internal subset is permitted according to the requirements of XML. Some validating XML processors may dereference the system identifier, if used, but most browsers use non-validating processors.

Void Elements Void elements only have a start tag; end tags must not be specified for void elements, and it is impossible for them to contain any content. A trailing slash may optionally be inserted at the end of the element's tag, immediately before the closing greater-than sign. Void elements may use either the empty-element tag syntax (EmptyElemTag) or use a start tag immediately followed by an end tag, with no content in between. While it is possible for the element to contain content, this is non-conforming.
Raw text elements
RCDATA elements
Foreign elements
Normal elements
Optional tags

For some elements, the start and/or end tags are optional and are implied by certain specified conditions. For example, the end tag for the p element is implied by a subsequent p element.

Omitting the end tag for other elements is a parse error and various error recovery procedures are applied appropriately.

End tags must be explicitly included for all elements, except empty elements using the EmptyElemTag syntax.
Unescaped Special Characters

Unescaped ampersands (U+0026 AMPERSAND - &, instead of &amp;) are permitted within the content of normal elements, RCDATA elements, foreign elements and attribute values where they are not considered to be ambiguous ampersands, and within Raw text elements.

Unescaped less than signs (U+003C LESS-THAN SIGN - <, instead of &lt;) are permitted in Raw text elements, RCDATA elements and attribute values, excluding the unquoted attribute value syntax.

Unescaped ampersands and less-than signs may not appear within CharData or AttValue (basically, the normal text content of elements and attribute values.) Violation of this constraint is a well-formedness error.
Comment syntax Comments must start with the four character sequence "<!--" and must be ended by the three character sequence "-->". The content of comments must not start with a single U+003E GREATER-THAN SIGN ('>') character, nor start with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character, nor contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character. Violating these constraints is a parse error and various error recovery procedures are applied appropriately. The content of comments must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a hypen. Violating this is a well-formedness error.
CDATA sections <![CDATA[...]]> is a a bogus comment. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is just regular character data. <![CDATA[...]]> is a CDATA section. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is a well-formedness error.
Processing Instructions HTML does not allow processing instructions and deprecates the bogus comments which appear in their form, whether in the form <?foo ...> (without a closing '?') or <?foo ...?>. XHTML allows the use of XML processing instructions which are only closed by "?>". If wishing to ensure XHTML is compatible when moved to HTML, it must avoid ">" inside processing instructions (as these will close the "instruction" (comment) prematurely) (or one must strip out processing instructions entirely).
Character References
Entity References

Content to move into table

THIS PAGE IS IN THE PROCESS OF BEING REVISED

  • HTML Parse Errors with special handling:
    • End tags with attributes.
    • Unexpected end tags (in HTML, an unexpected </br> or </p> can cause the start tag to be implied before it).
  • In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
  • In HTML, the script and style elements are parsed as CDATA elements. (Note: the definition of CDATA differs from that in XML). In XML, they're parsed as normal elements (which means that things that look like comments are treated as real comments, and things that look like start tags actually are start tags).
  • In HTML, the title and textarea elements are parsed as RCDATA elements. (Note: The definition of RCDATA differs from that in SGML and there is no RCDATA in XML).
  • In HTML, if scripting is enabled, the noscript element is parsed as an CDATA element. If scripting is disabled, it's parsed as a normal element. In XHTML, the element is always parsed as a normal element, and can't really be used to stop content from being present when script is disabled.
  • In HTML, the iframe, noembed and noframes elements are parsed as CDATA elements. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used.
  • White space characters in attribute values are normalized to spaces in XHTML.
  • In HTML, elements with optional tags are implied in certain conditions.
  • In HTML, tags for certain elements, which appear out of context, are ignored. This includes caption, col, colgroup, frame, frameset, head, option, optgroup, tbody, td, tfoot, th, thead, tr.
  • The plaintext element has a special parsing requirement in HTML. (It is, however, forbidden.)
  • In HTML, a line feed that immediately follows a pre, listing or textarea start tag is ignored.
  • Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML.
  • In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
  • In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
  • In XHTML, empty elements may use either the empty element syntax (<br/>) or have an end tag immediately follow the start tag (<br></br>). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and can be omitted. End tags for void elements are forbidden.
    • base, link, meta, hr, br, img, embed, param, area, col and input
  • HTML allows attribute minimisation (i.e. omitting the equals sign and the value), XHTML does not.
  • HTML allows the use of unquoted attribute values, XHTML does not.
  • In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;)
    • You can provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
  • The valid set of unicode characters in XML 1.0 is limited beyond that in HTML.

HTML Elements with Optional Tags

Element Start Tag End Tag
html optional optional
head optional optional
body optional optional
li required optional
dt required optional
dt required optional
p required optional
colgroup optional optional
thead required optional
tbody optional optional
tfoot required optional
tr required optional
th required optional
td required optional
rt required optional
rp required optional
optgroup required optional
option required optional

Markup

  • The namespace declaration (xmlns attribute) is required in XHTML. The xmlns attribute is also allowed to appear on any element in HTML on the condition that is has the value "http://www.w3.org/1999/xhtml".
    • <html xmlns="http://www.w3.org/1999/xhtml">
    • In HTML, the xmlns attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the attribute ends up in the null namespace
    • In XML (with an XML Namespaces-aware parser), an xmlns attribute is part of the namespace declaration mechanism, and an element cannot actually have an xmlns attribute in the null namespace. In DOM implementations, the attribute ends up in the "http://www.w3.org/2000/xmlns/" namespace.
  • XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not.
  • In HTML, the noscript element may be used. In XHTML, it is forbidden.
  • In XHTML, table elements may contain child tr elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).

Character Encoding

  • In XHTML, the XML declaration may be used to specify the character encoding. In HTML, the XML declaration is forbidden
  • In HTML, the meta element with a charset attribute may be used instead. It is forbidden in XHTML unless it specifies 'UTF-8' (case insensitively) and is ignored if included.
  • The default character encoding for XHTML is, according to XML rules, UTF-8 or UTF-16. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).

Scripts

  • document.write() and document.writeln() cannot be used in XHTML, they can in HTML.
  • In XHTML, the use of the innerHTML property requires that the string be a well-formed fragment of XML.
  • DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace)
    • Element.tagName and Node.nodeName return the value in uppercase.
    • Document.createElement() is case insensitive (the canonical form is lowercase).
    • Element.setAttributeNode() will change the attribute name to lowercase.
    • Element.setAttribute() is case insensitive (the canonical form is lowercase).
    • Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
    • Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name will be lowercased before the rename takes place.
  • In HTML, Document.createElement() will create an element in the HTML namespace. In XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
    • In XHTML, browsers lack interoperability in this area. In Firefox and Safari, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element.
  • XPath expressions targeted at pre-HTML5 browsers need to use the XHTML namespace for XHTML and null for HTML. (HTML5 browsers would use the XHTML namespace even in HTML.)

Stylesheets

  • Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
  • CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.

Differences Between HTML4 and HTML5

See HTML5 differences from HTML4.

Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs

This section might belong on a separate page.

  • TODO (need to talk about the changes to the DOM API that HTML5 is making, compared with DOM2 and DOM3)

Translations