A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

HTML vs. XHTML: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(Delete Spam)
 
(120 intermediate revisions by 18 users not shown)
Line 1: Line 1:
== Differences Between HTML and XHTML ==
== Differences Between HTML and XHTML ==


'''Please note that the information in here is based upon the current spec for (X)HTML5.  Some of the issues technically do not apply to previous versions of HTML.'''
<p style="border: 1px dashed lightgray; background-color: #FFEEEE; padding: .5em 1em;"><strong>This page is currently being revised. Some information is incomplete or missing.</strong></p>
 
<p style="border: 1px dashed lightgray; background-color: #FFF8E4; padding: .5em 1em;">Please note that the information in here is based upon the current spec for (X)HTML5.  Some of the issues technically do not apply to previous versions of HTML.</p>


Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.
Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.


'''Note''': As the current WHATWG document is a draft, this section will need to track to a moving target.
:'''Note''': As the current WHATWG document is a draft, this section will need to track to a moving target.
  Differences marked @@@ are differences that could theoretically be changed without affecting
 
  backwards compatibility.
The document at http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html provides a similar analysis.
 
=== Overlap Language ===
 
There is a community who find it valuable to be able to serve HTML5 documents which are also valid XML documents. They may, for example, use XML tools to generate the document, and they and others may process the document using XML tools.  These documents are served as text/html.
 
This language is sometimes called "polyglot". It is the overlap language of documents which are both HTML5 documents and XML documents. Guidelines are listed below for how one can construct such a polyglot document which will work in either environment. Besides following the well-formedness rules of XML, there are some other restrictions to which one must adhere (for the sake of text/html documents).
 
This wiki web page is an example of such a document. You can parse it with an XML parser or an HTML parser.


=== MIME Types ===
=== MIME Types ===


* XHTML must be served with an XML MIME type, such as <code>application/xml</code> or <code>application/xhtml+xml</code>.
{| class="wikitable" border="1"
* HTML must be served as <code>text/html</code>.
|-
!  Feature
!  HTML Requirement
XHTML Requirement
!  Notes
|-
|  Mime Type
|  Must use <code>text/html</code>.
|  Must use an XML MIME type, such as <code>application/xml</code> or <code>application/xhtml+xml</code>.
|  It is the MIME type (which may or may not be determined by file extension) that determines what type of document you are using.  Any document served as <code>text/html</code>, including a document authored with the intention of being XHTML, is technically an HTML document.
|}


It is the MIME type that determines what type of document you are using. If you use attempt to send XHTML as <code>text/html</code>, you are actually just using HTML, possibly with syntax errors.
Note that XHTML 1.0 previously defined that documents adhering to the compatibility guidelines were allowed to be served as <code>text/html</code>, but HTML 5 now defines that such documents are HTML, not XHTML.


Technically, according to the spec, XHTML 1.0 is allowed to be served as <code>text/html</code>.  But, due to the above reason, such a document is considered to be an HTML document, not an XHTML document.
=== Syntax and Parsing ===


=== Parsing ===
XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.  The following table describes the differences between how each is parsed.


XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.
The column on "Guidance for XHTML-HTML compatibility" lists ways in which a document can be crafted to work in either XHTML or HTML. The item will be bolded if it is a requirement for XHTML-compliant code to be changed, since XHTML will otherwise usually work as HTML, at least if its full features are constrained.


* In XHTML, well-formedness errors are fatal. In HTML, error handling rules are much more graceful. Well-formedness errors, which are also syntax errors in HTML, include the following:
{| class="wikitable" border="1"
** Unencoded ampersands (<code>&amp;</code> instead of <code>&amp;amp;</code>), and less than signs (<code>&lt;</code> instead of <code>&amp;lt;</code>) (This does not apply to <code>CDATA</code>).
|-
** Comments containing extra pairs of hyphens or ending with a hyphen. e.g.
!  Feature
*** <code>&lt;!--<var> syntax -- error </var>--&gt;</code> or
!  HTML Requirement
*** <code>&lt;!--<var> syntax error -</var>--&gt;</code>.
XHTML Requirement
** Mismatched end tags (does not apply to elements with optional tags)
!  Notes
** Unclosed tags.
! Guidance for XHTML-HTML compatibility
** Unexpected characters occuring in or before attribute names.
|-
** Unexpected occurrence of EOF.
!Parsing Modes
** Unexpected characters before the DOCTYPE name.
|Three parsing modes are defined: ''no quirks mode'', ''quirks mode'' and ''limited quirks mode''. The mode is only ever changed from the default by the HTML parser, based on the presence, absence, or value of the DOCTYPE string, respectively.
** Missing DOCTYPE name.
|XML parsing rules are used. There is only one mode.
** A <code>PUBLIC</code> identifer in a <code>DOCTYPE</code> without a <code>SYSTEM</code> identifier (Note: including either of these is a syntax error in HTML5; but, in XML only the <code>SYSTEM</code> identifier is allowed to occur on its own).
|The parsing modes in HTML also have an effect upon script and stylesheet processing. XHTML is considered to be in ''no quirks mode'' for these purposes.
** End tags with attributes.
| '''Use an explicit <code>&lt;!DOCTYPE html&gt;</code> (case insensitively) or legacy-compat version <code>&lt;!DOCTYPE html SYSTEM "about:legacy-compat"&gt;</code> for the sake of HTML and thus trigger no quirks parsing.'''
** Unexpected end tags (in HTML, an unexpected <code>&lt;/br></code> or <code>&lt;/p></code> can cause the start tag to be implied before it).
|-
* The internal subset is permitted in XML, but meaningless (and forbidden) in HTML.
!Error Handling
** In some cases, an internal subset in HTML would end up being partly rendered inline.
|HTML does not have a well-formedness constraint, no errors are fatal. Graceful error handling and recovery procedures are thoroughly defined.
* The sequence of characters &quot;<code>]]&gt;</code>&quot; when it does not mark the end of a <code>CDATA</code> section is a well-formedness error in XHTML, but valid in HTML.
|Well-formedness errors are fatal
* In XHTML: <code>&lt;![CDATA[...]]&gt;</code> is a <code>CDATA</code> section. In HTML, it's a bogus comment.
* In XHTML, <code>&lt;?foo ...?&gt;</code> is a processing instruction. In HTML, it's a bogus comment.
| Ensure there are no well-formedness errors.
* In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
|-
* In HTML, the <code>script</code> and <code>style</code> elements are parsed as <code>CDATA</code>. (Note: the definition of <code>CDATA</code> differs from that in XML). In XML, they're parsed as normal elements (which means that comments are treated as <em>real</em> comments, and things that look like start tags actually are start tags).
! Character Encoding (including XML Declaration, <code>meta</code>)
* In HTML, the <code>title</code> and <code>textarea</code> elements are parsed as <code>RCDATA</code>. (Note: The definition of <code>RCDATA</code> differs from that in SGML and there is no <code>RCDATA</code> in XML).
| The XML declaration is forbidden (treated as a bogus comment, but such style of comments are deprecated), but the <code>meta</code> element with a <code>charset</code> attribute may be used instead.
* In HTML, if scripting is enabled, the <code>noscript</code> element is parsed as <code>CDATA</code>. If scripting is disabled, it's parsed as <code>PCDATA</code>. In XHTML, the element has no effect, and can't really be used to stop content from being present when script is disabled.
If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).
* In HTML, the <code>iframe</code>, <code>noembed</code> and <code>noframes</code> elements are parsed as <code>CDATA</code>. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used.
| The XML declaration may be used to [http://wiki.whatwg.org/wiki/FAQ#How_do_I_specify_the_character_encoding.3F specify the character encoding], while <code>meta</code> is only allowed as case-insensitive "UTF-8" (and is ignored if included).
* White space characters in attribute values are [http://www.w3.org/TR/REC-xml/#AVNormalize normalized] to spaces in XHTML.
The default character encoding for XHTML is, according to XML rules, <code>UTF-8</code> or <code>UTF-16</code>.
* Elements with optional tags are implied in certain conditions.
|
* In HTML, <code>base</code>, <code>link</code>, <code>meta</code>, <code>style</code> and <code>title</code> elements with tags occurring in the body are moved inserted into the head. In XHTML, they stay where they were specified.
| '''If you need to include XML 1.1-only markup, if you do not wish to convert the encoding of the document to UTF-8 or UTF-16 (since use of other encodings also requires a declaration), or if you wish to define an external SYSTEM DTD in the DOCTYPE but use standalone=yes (redundant?), you must use an XML Declaration for XHTML, but this may not be allowable in the future in HTML. For future compatibility, it would be best to avoid XML 1.1-only markup, convert to UTF-8 or UTF-16 (probably UTF-8 which could allow use of a <code>meta</code> tag), and avoid use of a SYSTEM DTD (rendering the standalone=yes unnecessary), respectively. Do not use a <code>meta</code> tag, unless it is UTF-8 (and included in the first 512 bytes of the document), in which case it is probably a good idea to include it for the sake of HTML (as <nowiki><meta charset="UTF-8"></nowiki>) in case you cannot specify such in a content header.'''
* In HTML, tags for certain elements, which appear out of context, are ignored. This includes <code>caption</code>, <code>col</code>, <code>colgroup</code>, <code>frame</code>, <code>frameset</code>, <code>head</code>, <code>option</code>, <code>optgroup</code>, <code>tbody</code>, <code>td</code>, <code>tfoot</code>, <code>th</code>, <code>thead</code>, <code>tr</code>.
|-
* The <code>plaintext</code> element has a special parsing requirement in HTML. (it is, however, forbidden).
!Namespaced elements
* <em>Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML.</em>
|Elements and attributes for known vocabularies (HTML, SVG and MathML) are implicitly assigned to appropriate namespaces, according to the rules specified in the parsing algorithm. Elements in the HTML, SVG, or MathML namespaces may have an <code>xmlns</code> attribute explicitly specified, if, and only if, it has the exact value <code>"http://www.w3.org/1999/xhtml"</code> (see [http://wiki.whatwg.org/wiki/FAQ#What_is_the_namespace_declaration.3F namespace declaration]).  The attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the xmlns attribute itself ends up in no namespace. Foreign elements are also not treated as being in another namespace and will have no effect except for displaying by default as inline elements (and be aware that self-closing elements cannot be used as such since unrecognized elements will be treated as though they are non-void; thus one cannot, for example, type <code><caesura /></code> in HTML or it will be treated as though there is no immediate closing tag). Namespaced prefixes are not allowed on HTML elements; a prefixed xmlns attribute cannot be used even if it is defined in the XHTML namespace.
| The XHTML namespace must be declared for HTML elements according to the rules defined by the ''[http://www.w3.org/TR/REC-xml-names/ Namespaces in XML]'' specification. Namespaces must be explicitly declared. The <code>xmlns</code> attribute ends up in the <code>"http://www.w3.org/2000/xmlns"</code> namespace. Foreign elements can be used independently of HTML elements, as long as they are assigned to their own namespace.
|
| Declare HTML namespaces (or other namespaces) explicitly and do not prefix XHTML elements. '''Do not depend on the behavior of foreign namespaced elements in an HTML setting; if you need to include these, you will probably wish to set this foreign markup via CSS to <code>display:none</code>. You should explicitly close (not self-close) all empty elements defined in a non-XHTML namespace, since otherwise when used in HTML, HTML will treat them as though they have not been closed.'''
|-
!Namespaced attributes on HTML elements
| Attributes of the form <code>xmlns:<var>prefix</var></code> may not be used on HTML elements.
| The <code>xmlns:<var>prefix</var></code> attributes end up in the <code>"http://www.w3.org/2000/xmlns"</code> namespace.
|
| '''Do not use namespaced attributes on HTML elements. Do not depend on the behavior of foreign attributes in an HTML setting.'''
|-
!Namespace attributes on foreign elements
|
Elements in the SVG namespace may have an <code>xmlns</code> attribute specified, if, and only if, it has the exact value <code>"http://www.w3.org/2000/svg"</code>. The attribute is optional because the namespace is implied during parsing.


=== Syntax ===
Elements in the MathML namespace may have an <code>xmlns</code> attribute specified, if, and only if, it has the exact value <code>"http://www.w3.org/1998/Math/MathML"</code>.  The attribute is optional because the namespace is implied during parsing.


* In HTML, [http://blog.whatwg.org/faq/#doctype the <code>doctype</code> is required]. In XHTML, it is optional.
Foreign elements may also have an <code>xmlns:xlink</code> attribute specified, if, and only if, it has the exact value <code>"http://www.w3.org/1999/xlink"</code>. This attribute is optional, even if XLink attributes are used, because the namespaces for XLink attributes is implied during parsing.
* In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
* In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
** <code>html</code> (both)
** <code>head</code> (both)
** <code>body</code> (both)
** <code>li</code> (end tag)
** <code>dt</code> (end tag)
** <code>dd</code> (end tag)
** <code>p</code> (end tag)
** <code>colgroup</code> (both)
** <code>thead</code> (end tag)
** <code>tbody</code> (both)
** <code>tfoot</code> (end tag)
** <code>tr</code> (end tag)
** <code>td</code> (end tag)
** <code>th</code> (end tag)
* In XHTML, empty elements may use either the empty element syntax (<code>&lt;br/&gt;</code>) or have an end tag immediately follow the start tag (<code>&lt;br&gt;&lt;/br&gt;</code>). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and can be omitted. End tags for void elements are forbidden.
** <code>base</code>,<code> link</code>, <code>meta</code>, <code>hr</code>, <code>br</code>, <code>img</code>, <code>embed</code>, <code>param</code>, <code>area</code>, <code>col</code> and <code>input</code>
** Note: the following are treated as void elements for the purpose in the parsing requirements, but, as they are obsolete and non-standard, the trailing slash is not permitted:  <code>basefont</code>, <code>b</code><code>gsound</code>, <code>spacer</code>, <code>wbr</code>. (although, since these elements are not permitted anyway, it doesn't make much difference).
* HTML allows attribute minimisation (i.e. omitting the value), XHTML does not.
* HTML allows the use of unquoted attribute values, XHTML does not.
* XHTML allows the use of <code>CDATA</code> sections, HTML does not.
* XHTML allows the use of processing instructions, HTML does not.
* In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities: <code>&amp;amp;</code>, <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;apos;)</code>
** You may provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
* The valid set of unicode characters  in XML 1.0 is limited beyond that in HTML.
* Namespace prefixes are permitted in XHTML. They are forbidden in HTML.  


=== Markup ===
When parsed by an HTML parser, the <code>xmlns</code> and <code>xmlns:xlink</code> attributes end up in the <code>"http://www.w3.org/2000/xmlns"</code> namespace.
|The SVG and MathML namespaces must be declared for SVG and MathML elements, respectively, according to the rules defined by ''Namespaces in XML''.  The <code>xmlns</code> and <code>xmlns:<var>prefix</var></code> attributes end up in the <code>"http://www.w3.org/2000/xmlns"</code> namespace.
|
|
|-
!XLink attributes
|Foreign elements may use the attributes <code>xlink:actuate</code>, <code>xlink:arcrole</code>, <code>xlink:href</code>, <code>xlink:role</code>, <code>xlink:show</code>, <code>xlink:title</code> and <code>xlink:type</code>.  These attributes are placed in the <code>"http://www.w3.org/1999/xlink"</code>.  The prefix used must be "<code>xlink</code>".
|XLink attributes may be specified on foreign elements using any prefix, subject to the conformance rules defined by ''Namespaces in XML''.  The XLink namespace must be declared according to the conformance rules defined by ''Namespaces in XML'' if XLink attributes are used within the document.
|
| '''Do not use XLink attributes on HTML elements and do not depend on them on foreign elements as will not work as such in HTML.''' If being used, ensure they have the appropriate XLink namespace defined.
|-
!XML attributes
|
Foreign elements may use the attributes <code>xml:lang</code>, <code>xml:id</code>, <code>xml:base</code> and <code>xml:space</code>.  These attributes are placed in the <code>"http://www.w3.org/XML/1998/namespace"</code>.  The prefix used must be "<code>xml</code>".


* The [http://blog.whatwg.org/faq/#namespace-decl namespace declaration] (<code>xmlns</code> attribute) is required in XHTML. The xmlns attribute is also allowed to appear on the <code>html</code> element in HTML on the condition that is has the value <code><nowiki>"http://www.w3.org/1999/xhtml"</nowiki></code>.
HTML elements may use the <code>xml:lang</code> attribute. The attribute in no namespace with no prefix and with the literal localname "<code>xml:lang</code>" has no effect on language processing (as does "<code>lang</code>". HTML elements must not use the <code>xml:base</code>, <code>xml:space</code>, or <code>xml:id</code> attributes.
** <code><nowiki>&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;</nowiki></code>
| Any element, including HTML elements, may use the attributes <code>xml:lang</code>, <code>xml:id</code>, <code>xml:base</code> and <code>xml:space</code>These attributes are placed in the <code>"http://www.w3.org/XML/1998/namespace"</code>.  The prefix used must be "<code>xml</code>".
** In HTML, the xmlns attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier.  When parsed by an HTML parser, the attribute ends up in the null namespace
|
** In XML (with an [http://www.w3.org/TR/xml-names/ XML Namespaces]-aware parser), an xmlns attribute is part of the namespace declaration mechanism, and an element cannot actually have an xmlns attribute in the null namespaceIn DOM implementations, the attribute ends up in the "<code><nowiki>http://www.w3.org/2000/xmlns/</nowiki></code>" namespace.
| '''Though they can be used on foreign elements, do not use <code>xml:base</code>, <code>xml:id</code>, or <code>xml:space</code> on HTML elements; use both xml:lang and lang attributes whenever one is to be needed on HTML elements.'''
* XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not.
|-
* XHTML uses the <code>xml:lang</code> attribute, HTML uses <code>lang</code> instead,
!Attributes
* XML ID introduces <code>xml:id</code>, which could be used in XHTML. In HTML it has no effect.
| Names are not case sensitive. Attribute minimization is allowed (i.e. omitting the equals sign and the value).
* In HTML, the <code>noscript</code> element may be used. In XHTML, it is forbidden.
| Names are case sensitive (and lower case). Attribute minimization is not allowed.  
* HTML uses the <code>base</code> element, XHTML uses <code>xml:base</code> instead.  
|
* In XHTML, <code>p</code> elements may contain structured inline level elements including <code>blockquote</code>, <code>dl</code>, <code>menu</code>, <code>ol</code>, <code>ul</code>, <code>pre</code> and <code>table</code>. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
| Use lower case attribute names. Do not minimize attributes. Non-namespaced attributes not belonging to HTML will be included in the DOM tree and accessible to script and stylesheets, but it is discouraged to use these due to the potential for future naming conflicts; <code>data-</code> attributes can be used instead, or if in an XML-only environment, namespaced attributes.
* In XHTML, <code>table</code> elements may contain child <code>tr</code> elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
|-
!Attribute values
| White space characters are not normalized. Unquoted attribute values are allowed. Fixed or default attribute values ...?
| White space characters are [http://www.w3.org/TR/REC-xml/#AVNormalize normalized] to single spaces (unless attribute is of CDATA type?). Unquoted attribute values are not allowed. Default attribute values could conceivably be defined with a DTD.
|
| Create whitespace in attribute values which is already normalized (converted to single spaces). Always quote attribute values. '''Do not rely on defining default or fixed attribute values (or elements with exclusively element content) in a DTD (unless it matches HTML behavior).'''
|-
!Space characters
|The space characters are defined as:
* U+0009 CHARACTER TABULATION
* U+000A LINE FEED
* U+000C FORM FEED
* U+000D CARRIAGE RETURN
* U+0020 SPACE
|The space characters are defined as:
* U+0009 CHARACTER TABULATION
* U+000A LINE FEED
* U+000D CARRIAGE RETURN
* U+0020 SPACE
|The difference is the inclusion of Form Feed. Form feed characters are discouraged in XML 1.1.
| Do not use the form feed character.
|-
!  The DOCTYPE
|
A DOCTYPE is a mostly useless, but required, header. The DOCTYPE is used during parsing to determing the parsing mode.  The keywords "<code>DOCTYPE</code>", "<code>PUBLIC</code>" and "<code>SYSTEM</code>", and the name "<code>html</code>" are treated case insensitively. The system identifier <code>"about:legacy-compat"</code> (and the public and system identifiers for previous versions of HTML) are case sensitive.


=== Character Encoding ===
Conforming HTML documents are required to use <code>&lt;!DOCTYPE html&gt;</code> (case insensitively) or the legacy-compat version <code>&lt;!DOCTYPE html SYSTEM "about:legacy-compat"&gt;</code>.


* In XHTML, the XML declaration may be used to [http://blog.whatwg.org/faq/#charset specify the character encoding]. In HTML, the xml declaration is forbidden
When using the obsolete but conforming DOCTYPEs based on the HTML 4.0 and 4.01 Strict DTDs, the system identifier is optional. The obsolete but conforming DOCTYPEs based on XHTML 1.0 Strict and XHTML 1.1 may also be specified.
* In HTML, the <code>meta</code> element may be used insted. The <code>http-equiv</code> attribute on the <code>meta</code> element is forbidden in XHTML and is ignored if included.
* The default character encoding for XHTML is, according to XML rules, <code>UTF-8</code> or <code>UTF-16</code>. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).


=== Scripts ===
Use of an internal subset is forbidden.  The system identifier is never de-referenced by HTML implementations.
|
The DOCTYPE is optional.  XML rules for case sensitivity apply (everything is case sensitive).


* <code>document.write()</code> and <code>document.writeln()</code> cannot be used in XHTML, they can in HTML.
Either of the DOCTYPEs defined in HTML5 may be used, or any other custom DOCTYPE.  If the public identifier is specified, the system identifier must also be specifiedThe obsolete status of the ''obsolete permitted DOCTYPEs'' defined for HTML does not apply to XHTML.  Any DOCTYPE may be used, subject to the conformance rules defined by XML.
* In XHTML, the use of the <code>innerHTML</code> property requires that the string be a well-formed fragment of XML.
* DOM APIs are case sensitive in XHTML and some are case insensitive in HTML(This does not apply to elements which are not in the HTML namespace)
** Element.tagName, Node.nodeName, and Node.localName return the value in uppercase.
** Document.createElement() is case insensitive (the canonical form is lowercase).
** Element.setAttributeNode() will change the attribute name to lowercase.
** Element.setAttribute()  is case insensitive (the canonical form is lowercase).
** Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
** Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name must be lowercased before the rename takes place.
* In HTML, Document.createElement() will create an element in the HTML namespaceIn XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
** In XHTML, browsers lack interoperability in this areaIn Firefox, the namespace is dependent upon the MIME type.  In Opera, it's dependent upon the root element and in Safari, it's always null.


=== Stylesheets ===
Use of an internal subset is permitted according to the requirements of XML.  Some validating XML processors may dereference the system identifier, if used, but most browsers use non-validating processors.
|
| '''Use the empty DOCTYPE with no SYSTEM or PUBLIC identifiers and no use of internet subset.'''
|-
! Element names
| Element names are case insensitive.
| Element names are case sensitive and lower-case.
|
| Only use lower-case element names (as with attributes).
|-
!  Void vs. Non-void Elements
|  Void elements only have a start tag; end tags must not be specified for void elements, and it is impossible for them to contain any content.  A trailing slash may optionally be inserted at the end of the element's tag, immediately before the closing greater-than sign. For non-void elements (e.g., <nowiki><script></nowiki>), the trailing slash is a parsing error (ignored and thus treated as unclosed).
|  Void elements may use either the empty-element tag syntax (''EmptyElemTag'') or use a start tag immediately followed by an end tag, with no content in between.  While it is possible for the element to contain content, this is non-conforming.
|
| '''For void elements (e.g., <nowiki><br /></nowiki>), do not include content or use a closing tag; only use a self-closing element with closing slash at the end (with a space preceding it for the sake of older browsers). For non-void elements, i.e., where content can exist (e.g., <nowiki><script></nowiki>), always use an explicit closing tag (not a self-closing tag) even if there is no content.'''
|-
! Unexpected end tags
| Unexpected end tags (in HTML, an unexpected <code>&lt;/br></code> or <code>&lt;/p></code> can cause the start tag to be implied before it).
| Unexpected end tags are well-formedness errors.
|
| Do not add end tags unless there is an explicit and properly nested open tag before it.
|-
! End tag with attributes
| ?
| An end tag with attributes is not allowed.
|
| Do not use end tags with attributes.
|-
!  Raw text elements
|
|
|
|
|-
!  RCDATA elements
|
|
|
|
|-
!  Foreign elements
|
|
|
|
|-
!  Normal elements
|
|
|
|
|-
!  Optional tags
|
For [[#HTML_Elements_with_Optional_Tags|some elements]], the start and/or end tags are optional and are implied by certain specified conditions.  For example, the end tag for the <code>p</code> element is implied by a subsequent <code>p</code> element.


* Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
Omitting the end tag for other elements is a parse error and various error recovery procedures are applied appropriately.
* CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.
|  End tags must be explicitly included for all elements, except empty elements using the ''EmptyElemTag'' syntax.
| Always use end tags (or self-closing tags for void elements).
|-
!  Comment syntax
|  Comments must start with the four character sequence "<code>&lt;!--</code>" and must be ended by the three character sequence "<code>--></code>" (bogus comments such as those beginning with "<?" are deprecated).  The content of comments must not start with a single U+003E GREATER-THAN SIGN ('>') character, nor start with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character, nor contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character.  Violating these constraints is a parse error and various error recovery procedures are applied appropriately.
|  The content of comments must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a hyphen. Violating this is a well-formedness error.
|
| Only use comments of the "<code>&lt;!--...--></code>" variety. Do not use two consecutive U+002D HYPHEN-MINUS (-) characters in comment content or end with such a hyphen (especially for the sake of XML). '''Do not begin comments with a single U+003E GREATER-THAN SIGN ('>') character, nor with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character.'''
|-
!Processing Instructions
| HTML does not allow processing instructions and deprecates the bogus comments which appear in their form, whether in the form <code>&lt;?foo ...&gt;</code> (without a closing '?') or <code>&lt;?foo ...?&gt;</code>.
| XHTML allows the use of XML processing instructions which are only closed by "?>".
|
| '''Avoid ">" inside processing instructions (as these will close the "instruction" (comment) prematurely) (or one must strip out processing instructions entirely). Processing instructions might need to be avoided entirely in case HTML may in future disallow them completely.'''
|-
!CDATA sections
| <code>&lt;![CDATA[...]]&gt;</code> is a a bogus comment. The sequence of characters &quot;<code>]]&gt;</code>&quot; in content when it does not mark the end of a <code>CDATA</code> section is just regular character data. An exception is made for foreign content such as SVG or MathML.
| <code>&lt;![CDATA[...]]&gt;</code> is a <code>CDATA</code> section. The sequence of characters &quot;<code>]]&gt;</code>&quot; in content when it does not mark the end of a <code>CDATA</code> section is a well-formedness error.
|
| Ensure sequence &quot;<code>]]&gt;</code>&quot; in content is escaped (not necessary to escape in attribute values). '''Do not use CDATA sections (except possibly for script and style tags--see element-specific behavior below or for SVG/MathML).'''
|-
!  Unescaped Special Characters
|
Unescaped ampersands (U+0026 AMPERSAND - <code>&amp;</code>, instead of <code>&amp;amp;</code>) are permitted within the content of ''normal elements'', ''RCDATA elements'', ''foreign elements'' and ''attribute values'' where they are not considered to be ''ambiguous ampersands'', and within ''Raw text elements''.


== Differences Between HTML 4.01 and HTML 5 ==
Unescaped less than signs (U+003C LESS-THAN SIGN - <code>&lt;</code>, instead of <code>&amp;lt;</code>) are permitted in ''Raw text elements'', ''RCDATA elements'' and ''attribute values'', excluding the ''unquoted attribute value syntax''.
|  Unescaped ampersands and less-than signs may not appear within ''CharData'' or ''AttValue'' (basically, the normal text content of elements and attribute values.)  Violation of this constraint is a well-formedness error.
| Always escape ampersands and less-than signs in text content and attribute values. See CDATA for need to escape sequence "<code>]]&gt;</code>" in text content.
|-
!Character References
| The 'x' in a hexadecimal character reference can be upper-case.
| The 'x' in a hexadecimal character reference cannot be upper-case.
|
| Only use the lower-case 'x' for hexadecimal character references.
|-
!Entity References
| In HTML, all entity references are predefined and do not require a DTD.
| There is no formal DTD for XHTML5, but one could provide an exteranl DTD (if not an internal subset?) for use with one's entity-checking (or validating) parser, but be aware that browsers do not universally use external entity-checking (or validating) parsers and may not read the external DTD. (Some still have bugs in that they mistakenly create a well-formedness error out of such missing entities instead of showing them as missing, making them clickable, or using a entity-checking or validating parser.)
|
| Do not use entity references in XHTML (except for the 5 predefined entities: <code>&amp;amp;</code>, <code>&amp;lt;</code>, <code>&amp;gt;</code>, <code>&amp;quot;</code> and <code>&amp;apos;)</code>; use the equivalent Unicode or numeric character reference sequence instead.
|-
! Character data
| Unicode characters except for U+0000, non-characters, and control characters (besides space) characters.
| XML 1.0 only allows the following Unicode <code>#x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF]</code>


=== MIME Type ===
XML 1.1 allows all Unicode (including all in 1.0) except for U+0000, U+FFFE, and U+FFFF (i.e., it allows <code>[#x1-#xFFFD], [#x10000-#x10FFFF]</code>)


Both HTML 4.01 and HTML 5 use <code>text/html</code>.
Both XML 1.0 and 1.1 discourage control-characters and non-characters:


<code>Content-Type: text/html; charset=UTF-8</code>
Discouraged in XML 1.0 only: <code>[#xFDE0-#xFDEF]</code> (spec typo?)


=== Parsing HTML ===
Discouraged in XML 1.1 only (these are not allowed at all in 1.0): <code>[#x1-#x8], [#xB-#xC], [#xE-#x1F]</code>


Discouraged in XML 1.0-1.1: <code>[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]</code>
|
| Use <code>#x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF]</code> while avoiding <code>[#xFDE0-#xFDEF] (?), [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]</code>
|}


* HTML 2.0 to HTML 4.01 were formally based on SGML, but browsers did not implement SGML parsers. [http://www.w3.org/TR/html4/conform.html#h-4.2 4.2 SGML] and  [http://www.w3.org/TR/html4/appendix/notes.html#h-B.3 B.3 SGML implementation notes], HTML 4.01. This is a non-normative section of HTML 4.01 specification. And it already makes the difference between HTML user agents and SGML user agents.
====Element-specific parsing====
* HTML 5 is defines its own parsing requirements based on the way browsers actually handle HTML.


=== Syntax ===
<em>Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML.</em> (such as?)


* TODO
{| class="wikitable" border="1"
|-
!  Element(s)
!  HTML Requirement
!  XHTML Requirement
!  Notes
! Guidance for XHTML-HTML compatibility
|-
! <code>script</code> and <code>style</code>
| In HTML, these are parsed as <code>CDATA</code> elements. (Note: the definition of <code>CDATA</code> differs from that in XML).
| In XML, they're parsed as normal elements (which means that things that look like comments are treated as <em>real</em> comments, and things that look like start tags actually are start tags).
|
| '''The following code with escaping can ensure script and style elements will work in both XHTML and HTML, including older browsers.'''


=== Markup ===
In both cases, XML ignores the first comment and then uses the CDATA section to avoid the need for escaping special characters < and & within the rest of the contents (with subsequent JavaScript comments added within to ensure the HTML-oriented code is ignored by JavaScript).


* [http://www.w3.org/TR/html4/index/elements.html List of HTML 4.01 elements]
In HTML, older browsers might display the content without the content being within a comment, so comments are used to hide this from them (while modern HTML browsers will run code inside the comments). The subsequent JavaScript comment is added to negate the text added for the sake of XHTML.
* [http://www.w3.org/TR/html4/index/attributes.html List of HTML 4.01 attributes]
* [http://simon.html5.org/html5-elements HTML5 Elements and Attributes]


==== Obsolete Attributes ====
The &lt;style> requires the /**/ comments since CSS does not support the single line ones.


Some attributes that were defined in HTML4 are not included in HTML5. Here's a current list (subject to change, see the spec):
    '''&lt;script type="text/javascript">&lt;!--//-->&lt;![CDATA[//>&lt;!--
        ...
    //-->&lt;!]]>&lt;/script>


* html@version
* head@profile
* a@rev, link@rev
* a@target, area@target, base@target, form@target (is mentioned in WF2...), link@target
* a@charset, link@charset, script@charset
* table@summary
* td@headers, th@headers
* td@axis, th@axis
* param@valuetype
* object@standby
* meta@scheme
* object@archive


In addition, HTML5 has none of the presentational attributes that were in HTML4 (including those on &lt;table>). Any attributes defined on ''elements'' that are not in HTML5 are (obviously) also not in HTML5.
    &lt;style type="text/css">&lt;!--/*-->&lt;![CDATA[/*>&lt;!--*/
        ...
    /*]]>*/-->&lt;/style>'''


==== Obsolete Elements ====
If not concerned about much older browsers (from which one is hiding the HTML) one can use the simpler:


The following elements were present in HTML4 but are not defined in HTML5:
    &lt;script>//&lt;![CDATA[
   
    //]]>&lt;/script>


* acronym (use <abbr> instead)
    &lt;style>/*&lt;![CDATA[*/
* applet (use <object> instead)
   
* basefont
    /*]]>*/&lt;/style>
* big
* center
* dir
* font
* frame
* frameset
* isindex
* noframes
* noscript (only in XHTML)
* s
* strike
* tt
* u


=== Character Encoding ===
Also note that the sequence "]]>" is not allowed within a CDATA section, so it cannot be used in true XHTML-embedded JavaScript without escaping.
|-
! <code>title</code> and <code>textarea</code>
| In HTML, these elements are parsed as <code>RCDATA</code> elements. (Note: The definition of <code>RCDATA</code> differs from that in SGML).
| There is no <code>RCDATA</code> in XML
|
| Use &amp;amp; and &amp;lt; escape forms (and "]]&amp;gt;" if the sequence "]]>" is required) within these elements even though HTML does not require them (CDATA sections apparently cannot be added here in a polyglot-supportive fashion).
|-
! <code>noscript</code>
| In HTML, if scripting is enabled, this element is parsed as an <code>CDATA</code> element. If scripting is disabled, it's parsed as a normal element.
| In XHTML, the element is always parsed as a normal element, and can't really be used to stop content from being present when script is disabled.
|
| Add content to the page which should be shown when JavaScript is disabled and use JavaScript to hide these elements when the page has loaded (DOMContentLoaded can be used for modern browsers).
|-
! <code>iframe</code>, <code>noembed</code> and <code>noframes</code>
| In HTML, these elements are parsed as <code>CDATA</code> elements.
| In XHTML, they are parsed as normal elements, and therefore do not stop content from being used.
|
| '''Do not add content within these elements (or hide them on page load/DOMContentLoaded by JavaScript).'''
|-
! <code>caption</code>, <code>col</code>, <code>colgroup</code>, <code>frame</code>, <code>frameset</code>, <code>head</code>, <code>option</code>, <code>optgroup</code>, <code>tbody</code>, <code>td</code>, <code>tfoot</code>, <code>th</code>, <code>thead</code>, <code>tr</code> when appearing out of context
| In HTML, the tags for these elements, when appearing out of context, are ignored. (How so?)
|
|
| '''Do not use these elements out of context. In the case of &lt;tr> directly inside a &lt;table>, one may use an explicit tbody to avoid potential confusion.'''
|-
! <code>plaintext</code>
| This element has a special parsing requirement in HTML. (It is, however, forbidden.)
|
|
| '''Do not use plaintext.'''
|-
! <code>pre</code>, <code>listing</code> or <code>textarea</code>
| In HTML, a line feed that immediately follows any of these element's start tag is ignored.
| In XML, it is treated as other content.
|
| '''Add any line break before the element begins using HTML or CSS.'''
|-
!  In head (<code>base</code>, <code>link</code>, <code>meta</code>), in body (<code>area</code>,<code>br</code>, <code>col</code>, <code>embed</code>, <code>hr</code>, <code>img</code>, <code>input</code>, <code>param</code>, and now also <code>link</code> and <code>meta</code>)
| These elements are void elements in HTML.
| In XHTML, these may use explicit closing tags as well as self-closing ones (just as non-void elements can).
|
| '''Do not use an explicit closing tag for these void elements to avoid double-inclusion when shown in HTML (and avoid self-closing tags on non-void elements which can sometimes accept content (such as &lt;script>)).'''
|}


The character encoding can be declared using the meta element, but the syntax of the meta element has changed.  In HTML 4.01 and earlier, the meta element was:
==== HTML Elements with Optional Tags ====


&lt;meta http-equiv="Content-Type" content="text/html; charset=UTF-8"&gt;
'''For polyglot texts, always use the start and ending tag (unless it is a void element, in which case, self-closing tags must be used).'''


In HTML5, the syntax was simplified to remove the unnecessary markup, yet still remain compatible with the encoding detection implemented in most existing browsers.
{| class="wikitable" border="1"
|-
! Element
! Start Tag
! End Tag
|-
!html
|optional
|optional
|-
!head
|optional
|optional
|-
!body
|optional
|optional
|-
!li
|required
|optional
|-
!dt
|required
|optional
|-
!dd
|required
|optional
|-
!p
|required
|optional
|-
!colgroup
|optional
|optional
|-
!thead
|required
|optional
|-
!tbody
|optional
|optional
|-
!tfoot
|required
|optional
|-
!tr
|required
|optional
|-
!th
|required
|optional
|-
!td
|required
|optional
|-
!rt
|required
|optional
|-
!rp
|required
|optional
|-
!optgroup
|required
|optional
|-
!option
|required
|optional
|}


&lt;meta charset="UTF-8"&gt;
=== Scripts ===


==== HTML 4 Algorithm ====
{| class="wikitable" border="1"
|-
!  Feature
!  HTML Requirement
!  XHTML Requirement
!  Notes
! Guidance for XHTML-HTML compatibility
|-
! <code>document.write()</code> and <code>document.writeln()</code>
| Available in HTML.
| These cannot be used in XHTML.
|
| Use DOM methods to replace or add content dynamically.
|-
! <code>innerHTML</code> property
| Any HTML can be used.
| The use of this property requires that the string be a well-formed fragment of XML.
|
| Ensure one sets <code>innerHTML</code> to well-formed fragments.
|-
!
|
|
|
|
|-
! DOM APIs and case sensitivity
| Some DOM APIs are case insensitive in HTML (which are sensitive?). (This does not apply to elements which are not in the HTML namespace.)
| DOM APIs are case sensitive in XHTML
|
| Use lower-case elements, attributes, and attribute values (or as appropriate with SVG camel-cased elements and attributes (and the "definitionURL" attribute should use proper casing when used in MathML)).
|-
! Element.tagName and Node.nodeName properties
| These properties return the value in uppercase in HTML. (Node.localName is consistent now, as of HTML5.)
| These properties return the value in lower-case in XHTML.
|
| For older browsers, compare after converting to lower case.
|-
! Document.createElement()
| Case insensitive
|
|
| Use the canonical form, lowercase, for polyglot documents.
|-
! Element.setAttributeNode()
| Changes the attribute name to lowercase.
|
|
| Do not expect to use upper-case attribute names.
|-
! Element.setAttribute()
| Case insensitive
|
|
| Use the canonical form, lowercase, for polyglot documents.
|-
! Document.getElementsByTagName() and Element.getElementsByTagName()
| Case insensitive in HTML
|
|
| Use the canonical form, lowercase, for polyglot documents.
|-
! Document.renameNode()
| If the new namespace is the HTML namespace, then the new qualified name will be lowercased before the rename takes place.
|
|
| Do not expect to keep upper-case attribute names for HTML-namespaced elements after a rename.
|-
!
|
|
|
|
|-
! Document.createElement() and namespaces
| In HTML, this will create an element in the HTML namespace.
| In XML (including true XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
| In XHTML, browsers lack interoperability in this area.  In Firefox and Safari, the namespace is dependent upon the MIME type.  In Opera, it's dependent upon the root element.
| '''If operating within a browser which supports it, use Document.createElementNS to avoid the ambiguity.'''
|-
! XPath expressions
| In pre-HTML5 browsers, the XHTML namespace must be used for XHTML and null for HTML. (HTML5 browsers would use the XHTML namespace even in HTML.)
| In XHTML, all XPath will require a namespace unless the elements genuinely have no namespace.
|
| Detect whether the browser is pre-HTML5 and omit namespaces in XPath expressions if so (otherwise, use a namespace).
|}


Source [http://www.w3.org/TR/html4/charset.html#h-5.2.2 5.2.2 Specifying the character encoding], HTML 4.01 Specification.
=== Stylesheets ===
 
# An HTTP "charset" parameter in a "Content-Type" field.
# A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
# The charset attribute set on an element that designates an external resource.


==== HTML 5 Algorithm ====
{| class="wikitable" border="1"
|-
!  Feature
!  HTML Requirement
!  XHTML Requirement
!  Notes
! Guidance for XHTML-HTML compatibility
|-
! CSS Selectors
| Match case insensitively in HTML.
| Match case sensitively in XHTML
|
| For polyglot documents, use lower-case selectors or as appropriate (e.g., for SVG CamelCased items).
|-
! Styling of html/body elements
| CSS requires special handling of the body element in HTML for painting backgrounds on the canvas
| XHTML does not require special handling.
|
| Style the html and body elements appropriately (?).
|}


The exact algorithm that browsers must follow in order to [http://www.whatwg.org/specs/web-apps/current-work/#the-input0 determine the character encoding] is specified in HTML 5.  The basic algorithm works as follows:
== Differences Between HTML4 and HTML5 ==


# If the transport layer specifies an encoding, use that, and abort these steps. (e.g. The HTTP Content-Type header).
See [http://dev.w3.org/html5/html4-differences/ HTML5 differences from HTML4].
# Read the first 512 bytes of the file, or at least as much as possible if less than that.
# If the file starts with a UTF-8, UTF-16 or UTF-32 BOM, then use that and abort these steps.
# Otherwise use the special algorithm to search the first 512 bytes for a meta element that declares the encoding.  The algorithm is relatively lenient in what it will detect, though since it doesn't use the normal parsing algorithm, there are some restrictions.


== Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs ==
== Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs ==
Line 223: Line 564:


* [http://meiert.com/de/publications/translations/whatwg.org/html-vs-xhtml/ German translation: "HTML 5 und XHTML 5 im Vergleich (WHATWG)"]
* [http://meiert.com/de/publications/translations/whatwg.org/html-vs-xhtml/ German translation: "HTML 5 und XHTML 5 im Vergleich (WHATWG)"]
* [http://dancewithnet.com/2007/10/28/differences-between-html-and-xhtml/ Chinese translation: "HTML和XHTML的不同"]

Latest revision as of 12:41, 21 August 2011

Differences Between HTML and XHTML

This page is currently being revised. Some information is incomplete or missing.

Please note that the information in here is based upon the current spec for (X)HTML5. Some of the issues technically do not apply to previous versions of HTML.

Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.

Note: As the current WHATWG document is a draft, this section will need to track to a moving target.

The document at http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html provides a similar analysis.

Overlap Language

There is a community who find it valuable to be able to serve HTML5 documents which are also valid XML documents. They may, for example, use XML tools to generate the document, and they and others may process the document using XML tools. These documents are served as text/html.

This language is sometimes called "polyglot". It is the overlap language of documents which are both HTML5 documents and XML documents. Guidelines are listed below for how one can construct such a polyglot document which will work in either environment. Besides following the well-formedness rules of XML, there are some other restrictions to which one must adhere (for the sake of text/html documents).

This wiki web page is an example of such a document. You can parse it with an XML parser or an HTML parser.

MIME Types

Feature HTML Requirement XHTML Requirement Notes
Mime Type Must use text/html. Must use an XML MIME type, such as application/xml or application/xhtml+xml. It is the MIME type (which may or may not be determined by file extension) that determines what type of document you are using. Any document served as text/html, including a document authored with the intention of being XHTML, is technically an HTML document.

Note that XHTML 1.0 previously defined that documents adhering to the compatibility guidelines were allowed to be served as text/html, but HTML 5 now defines that such documents are HTML, not XHTML.

Syntax and Parsing

XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today. The following table describes the differences between how each is parsed.

The column on "Guidance for XHTML-HTML compatibility" lists ways in which a document can be crafted to work in either XHTML or HTML. The item will be bolded if it is a requirement for XHTML-compliant code to be changed, since XHTML will otherwise usually work as HTML, at least if its full features are constrained.

Feature HTML Requirement XHTML Requirement Notes Guidance for XHTML-HTML compatibility
Parsing Modes Three parsing modes are defined: no quirks mode, quirks mode and limited quirks mode. The mode is only ever changed from the default by the HTML parser, based on the presence, absence, or value of the DOCTYPE string, respectively. XML parsing rules are used. There is only one mode. The parsing modes in HTML also have an effect upon script and stylesheet processing. XHTML is considered to be in no quirks mode for these purposes. Use an explicit <!DOCTYPE html> (case insensitively) or legacy-compat version <!DOCTYPE html SYSTEM "about:legacy-compat"> for the sake of HTML and thus trigger no quirks parsing.
Error Handling HTML does not have a well-formedness constraint, no errors are fatal. Graceful error handling and recovery procedures are thoroughly defined. Well-formedness errors are fatal Ensure there are no well-formedness errors.
Character Encoding (including XML Declaration, meta) The XML declaration is forbidden (treated as a bogus comment, but such style of comments are deprecated), but the meta element with a charset attribute may be used instead.

If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).

The XML declaration may be used to specify the character encoding, while meta is only allowed as case-insensitive "UTF-8" (and is ignored if included).

The default character encoding for XHTML is, according to XML rules, UTF-8 or UTF-16.

If you need to include XML 1.1-only markup, if you do not wish to convert the encoding of the document to UTF-8 or UTF-16 (since use of other encodings also requires a declaration), or if you wish to define an external SYSTEM DTD in the DOCTYPE but use standalone=yes (redundant?), you must use an XML Declaration for XHTML, but this may not be allowable in the future in HTML. For future compatibility, it would be best to avoid XML 1.1-only markup, convert to UTF-8 or UTF-16 (probably UTF-8 which could allow use of a meta tag), and avoid use of a SYSTEM DTD (rendering the standalone=yes unnecessary), respectively. Do not use a meta tag, unless it is UTF-8 (and included in the first 512 bytes of the document), in which case it is probably a good idea to include it for the sake of HTML (as <meta charset="UTF-8">) in case you cannot specify such in a content header.
Namespaced elements Elements and attributes for known vocabularies (HTML, SVG and MathML) are implicitly assigned to appropriate namespaces, according to the rules specified in the parsing algorithm. Elements in the HTML, SVG, or MathML namespaces may have an xmlns attribute explicitly specified, if, and only if, it has the exact value "http://www.w3.org/1999/xhtml" (see namespace declaration). The attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the xmlns attribute itself ends up in no namespace. Foreign elements are also not treated as being in another namespace and will have no effect except for displaying by default as inline elements (and be aware that self-closing elements cannot be used as such since unrecognized elements will be treated as though they are non-void; thus one cannot, for example, type <caesura /> in HTML or it will be treated as though there is no immediate closing tag). Namespaced prefixes are not allowed on HTML elements; a prefixed xmlns attribute cannot be used even if it is defined in the XHTML namespace. The XHTML namespace must be declared for HTML elements according to the rules defined by the Namespaces in XML specification. Namespaces must be explicitly declared. The xmlns attribute ends up in the "http://www.w3.org/2000/xmlns" namespace. Foreign elements can be used independently of HTML elements, as long as they are assigned to their own namespace. Declare HTML namespaces (or other namespaces) explicitly and do not prefix XHTML elements. Do not depend on the behavior of foreign namespaced elements in an HTML setting; if you need to include these, you will probably wish to set this foreign markup via CSS to display:none. You should explicitly close (not self-close) all empty elements defined in a non-XHTML namespace, since otherwise when used in HTML, HTML will treat them as though they have not been closed.
Namespaced attributes on HTML elements Attributes of the form xmlns:prefix may not be used on HTML elements. The xmlns:prefix attributes end up in the "http://www.w3.org/2000/xmlns" namespace. Do not use namespaced attributes on HTML elements. Do not depend on the behavior of foreign attributes in an HTML setting.
Namespace attributes on foreign elements

Elements in the SVG namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/2000/svg". The attribute is optional because the namespace is implied during parsing.

Elements in the MathML namespace may have an xmlns attribute specified, if, and only if, it has the exact value "http://www.w3.org/1998/Math/MathML". The attribute is optional because the namespace is implied during parsing.

Foreign elements may also have an xmlns:xlink attribute specified, if, and only if, it has the exact value "http://www.w3.org/1999/xlink". This attribute is optional, even if XLink attributes are used, because the namespaces for XLink attributes is implied during parsing.

When parsed by an HTML parser, the xmlns and xmlns:xlink attributes end up in the "http://www.w3.org/2000/xmlns" namespace.

The SVG and MathML namespaces must be declared for SVG and MathML elements, respectively, according to the rules defined by Namespaces in XML. The xmlns and xmlns:prefix attributes end up in the "http://www.w3.org/2000/xmlns" namespace.
XLink attributes Foreign elements may use the attributes xlink:actuate, xlink:arcrole, xlink:href, xlink:role, xlink:show, xlink:title and xlink:type. These attributes are placed in the "http://www.w3.org/1999/xlink". The prefix used must be "xlink". XLink attributes may be specified on foreign elements using any prefix, subject to the conformance rules defined by Namespaces in XML. The XLink namespace must be declared according to the conformance rules defined by Namespaces in XML if XLink attributes are used within the document. Do not use XLink attributes on HTML elements and do not depend on them on foreign elements as will not work as such in HTML. If being used, ensure they have the appropriate XLink namespace defined.
XML attributes

Foreign elements may use the attributes xml:lang, xml:id, xml:base and xml:space. These attributes are placed in the "http://www.w3.org/XML/1998/namespace". The prefix used must be "xml".

HTML elements may use the xml:lang attribute. The attribute in no namespace with no prefix and with the literal localname "xml:lang" has no effect on language processing (as does "lang". HTML elements must not use the xml:base, xml:space, or xml:id attributes.

Any element, including HTML elements, may use the attributes xml:lang, xml:id, xml:base and xml:space. These attributes are placed in the "http://www.w3.org/XML/1998/namespace". The prefix used must be "xml". Though they can be used on foreign elements, do not use xml:base, xml:id, or xml:space on HTML elements; use both xml:lang and lang attributes whenever one is to be needed on HTML elements.
Attributes Names are not case sensitive. Attribute minimization is allowed (i.e. omitting the equals sign and the value). Names are case sensitive (and lower case). Attribute minimization is not allowed. Use lower case attribute names. Do not minimize attributes. Non-namespaced attributes not belonging to HTML will be included in the DOM tree and accessible to script and stylesheets, but it is discouraged to use these due to the potential for future naming conflicts; data- attributes can be used instead, or if in an XML-only environment, namespaced attributes.
Attribute values White space characters are not normalized. Unquoted attribute values are allowed. Fixed or default attribute values ...? White space characters are normalized to single spaces (unless attribute is of CDATA type?). Unquoted attribute values are not allowed. Default attribute values could conceivably be defined with a DTD. Create whitespace in attribute values which is already normalized (converted to single spaces). Always quote attribute values. Do not rely on defining default or fixed attribute values (or elements with exclusively element content) in a DTD (unless it matches HTML behavior).
Space characters The space characters are defined as:
  • U+0009 CHARACTER TABULATION
  • U+000A LINE FEED
  • U+000C FORM FEED
  • U+000D CARRIAGE RETURN
  • U+0020 SPACE
The space characters are defined as:
  • U+0009 CHARACTER TABULATION
  • U+000A LINE FEED
  • U+000D CARRIAGE RETURN
  • U+0020 SPACE
The difference is the inclusion of Form Feed. Form feed characters are discouraged in XML 1.1. Do not use the form feed character.
The DOCTYPE

A DOCTYPE is a mostly useless, but required, header. The DOCTYPE is used during parsing to determing the parsing mode. The keywords "DOCTYPE", "PUBLIC" and "SYSTEM", and the name "html" are treated case insensitively. The system identifier "about:legacy-compat" (and the public and system identifiers for previous versions of HTML) are case sensitive.

Conforming HTML documents are required to use <!DOCTYPE html> (case insensitively) or the legacy-compat version <!DOCTYPE html SYSTEM "about:legacy-compat">.

When using the obsolete but conforming DOCTYPEs based on the HTML 4.0 and 4.01 Strict DTDs, the system identifier is optional. The obsolete but conforming DOCTYPEs based on XHTML 1.0 Strict and XHTML 1.1 may also be specified.

Use of an internal subset is forbidden. The system identifier is never de-referenced by HTML implementations.

The DOCTYPE is optional. XML rules for case sensitivity apply (everything is case sensitive).

Either of the DOCTYPEs defined in HTML5 may be used, or any other custom DOCTYPE. If the public identifier is specified, the system identifier must also be specified. The obsolete status of the obsolete permitted DOCTYPEs defined for HTML does not apply to XHTML. Any DOCTYPE may be used, subject to the conformance rules defined by XML.

Use of an internal subset is permitted according to the requirements of XML. Some validating XML processors may dereference the system identifier, if used, but most browsers use non-validating processors.

Use the empty DOCTYPE with no SYSTEM or PUBLIC identifiers and no use of internet subset.
Element names Element names are case insensitive. Element names are case sensitive and lower-case. Only use lower-case element names (as with attributes).
Void vs. Non-void Elements Void elements only have a start tag; end tags must not be specified for void elements, and it is impossible for them to contain any content. A trailing slash may optionally be inserted at the end of the element's tag, immediately before the closing greater-than sign. For non-void elements (e.g., <script>), the trailing slash is a parsing error (ignored and thus treated as unclosed). Void elements may use either the empty-element tag syntax (EmptyElemTag) or use a start tag immediately followed by an end tag, with no content in between. While it is possible for the element to contain content, this is non-conforming. For void elements (e.g., <br />), do not include content or use a closing tag; only use a self-closing element with closing slash at the end (with a space preceding it for the sake of older browsers). For non-void elements, i.e., where content can exist (e.g., <script>), always use an explicit closing tag (not a self-closing tag) even if there is no content.
Unexpected end tags Unexpected end tags (in HTML, an unexpected </br> or </p> can cause the start tag to be implied before it). Unexpected end tags are well-formedness errors. Do not add end tags unless there is an explicit and properly nested open tag before it.
End tag with attributes ? An end tag with attributes is not allowed. Do not use end tags with attributes.
Raw text elements
RCDATA elements
Foreign elements
Normal elements
Optional tags

For some elements, the start and/or end tags are optional and are implied by certain specified conditions. For example, the end tag for the p element is implied by a subsequent p element.

Omitting the end tag for other elements is a parse error and various error recovery procedures are applied appropriately.

End tags must be explicitly included for all elements, except empty elements using the EmptyElemTag syntax. Always use end tags (or self-closing tags for void elements).
Comment syntax Comments must start with the four character sequence "<!--" and must be ended by the three character sequence "-->" (bogus comments such as those beginning with "<?" are deprecated). The content of comments must not start with a single U+003E GREATER-THAN SIGN ('>') character, nor start with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character, nor contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a U+002D HYPHEN-MINUS (-) character. Violating these constraints is a parse error and various error recovery procedures are applied appropriately. The content of comments must not contain two consecutive U+002D HYPHEN-MINUS (-) characters, nor end with a hyphen. Violating this is a well-formedness error. Only use comments of the "<!--...-->" variety. Do not use two consecutive U+002D HYPHEN-MINUS (-) characters in comment content or end with such a hyphen (especially for the sake of XML). Do not begin comments with a single U+003E GREATER-THAN SIGN ('>') character, nor with a U+002D HYPHEN-MINUS (-) character followed by a U+003E GREATER-THAN SIGN ('>') character.
Processing Instructions HTML does not allow processing instructions and deprecates the bogus comments which appear in their form, whether in the form <?foo ...> (without a closing '?') or <?foo ...?>. XHTML allows the use of XML processing instructions which are only closed by "?>". Avoid ">" inside processing instructions (as these will close the "instruction" (comment) prematurely) (or one must strip out processing instructions entirely). Processing instructions might need to be avoided entirely in case HTML may in future disallow them completely.
CDATA sections <![CDATA[...]]> is a a bogus comment. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is just regular character data. An exception is made for foreign content such as SVG or MathML. <![CDATA[...]]> is a CDATA section. The sequence of characters "]]>" in content when it does not mark the end of a CDATA section is a well-formedness error. Ensure sequence "]]>" in content is escaped (not necessary to escape in attribute values). Do not use CDATA sections (except possibly for script and style tags--see element-specific behavior below or for SVG/MathML).
Unescaped Special Characters

Unescaped ampersands (U+0026 AMPERSAND - &, instead of &amp;) are permitted within the content of normal elements, RCDATA elements, foreign elements and attribute values where they are not considered to be ambiguous ampersands, and within Raw text elements.

Unescaped less than signs (U+003C LESS-THAN SIGN - <, instead of &lt;) are permitted in Raw text elements, RCDATA elements and attribute values, excluding the unquoted attribute value syntax.

Unescaped ampersands and less-than signs may not appear within CharData or AttValue (basically, the normal text content of elements and attribute values.) Violation of this constraint is a well-formedness error. Always escape ampersands and less-than signs in text content and attribute values. See CDATA for need to escape sequence "]]>" in text content.
Character References The 'x' in a hexadecimal character reference can be upper-case. The 'x' in a hexadecimal character reference cannot be upper-case. Only use the lower-case 'x' for hexadecimal character references.
Entity References In HTML, all entity references are predefined and do not require a DTD. There is no formal DTD for XHTML5, but one could provide an exteranl DTD (if not an internal subset?) for use with one's entity-checking (or validating) parser, but be aware that browsers do not universally use external entity-checking (or validating) parsers and may not read the external DTD. (Some still have bugs in that they mistakenly create a well-formedness error out of such missing entities instead of showing them as missing, making them clickable, or using a entity-checking or validating parser.) Do not use entity references in XHTML (except for the 5 predefined entities: &amp;, &lt;, &gt;, &quot; and &apos;); use the equivalent Unicode or numeric character reference sequence instead.
Character data Unicode characters except for U+0000, non-characters, and control characters (besides space) characters. XML 1.0 only allows the following Unicode #x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF]

XML 1.1 allows all Unicode (including all in 1.0) except for U+0000, U+FFFE, and U+FFFF (i.e., it allows [#x1-#xFFFD], [#x10000-#x10FFFF])

Both XML 1.0 and 1.1 discourage control-characters and non-characters:

Discouraged in XML 1.0 only: [#xFDE0-#xFDEF] (spec typo?)

Discouraged in XML 1.1 only (these are not allowed at all in 1.0): [#x1-#x8], [#xB-#xC], [#xE-#x1F]

Discouraged in XML 1.0-1.1: [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]

Use #x9, #xA, #xD, [#x20-#xD7FF], [#xE000-#xFFFD], [#x10000-#x10FFFF] while avoiding [#xFDE0-#xFDEF] (?), [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF]

Element-specific parsing

Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML. (such as?)

Element(s) HTML Requirement XHTML Requirement Notes Guidance for XHTML-HTML compatibility
script and style In HTML, these are parsed as CDATA elements. (Note: the definition of CDATA differs from that in XML). In XML, they're parsed as normal elements (which means that things that look like comments are treated as real comments, and things that look like start tags actually are start tags). The following code with escaping can ensure script and style elements will work in both XHTML and HTML, including older browsers.

In both cases, XML ignores the first comment and then uses the CDATA section to avoid the need for escaping special characters < and & within the rest of the contents (with subsequent JavaScript comments added within to ensure the HTML-oriented code is ignored by JavaScript).

In HTML, older browsers might display the content without the content being within a comment, so comments are used to hide this from them (while modern HTML browsers will run code inside the comments). The subsequent JavaScript comment is added to negate the text added for the sake of XHTML.

The <style> requires the /**/ comments since CSS does not support the single line ones.

   <script type="text/javascript"><!--//--><![CDATA[//><!--
       ...
   //--><!]]></script>


   <style type="text/css"><!--/*--><![CDATA[/*><!--*/
       ...
   /*]]>*/--></style>

If not concerned about much older browsers (from which one is hiding the HTML) one can use the simpler:

   <script>//<![CDATA[
   
   //]]></script>
   <style>/*<![CDATA[*/
   
   /*]]>*/</style>

Also note that the sequence "]]>" is not allowed within a CDATA section, so it cannot be used in true XHTML-embedded JavaScript without escaping.

title and textarea In HTML, these elements are parsed as RCDATA elements. (Note: The definition of RCDATA differs from that in SGML). There is no RCDATA in XML Use &amp; and &lt; escape forms (and "]]&gt;" if the sequence "]]>" is required) within these elements even though HTML does not require them (CDATA sections apparently cannot be added here in a polyglot-supportive fashion).
noscript In HTML, if scripting is enabled, this element is parsed as an CDATA element. If scripting is disabled, it's parsed as a normal element. In XHTML, the element is always parsed as a normal element, and can't really be used to stop content from being present when script is disabled. Add content to the page which should be shown when JavaScript is disabled and use JavaScript to hide these elements when the page has loaded (DOMContentLoaded can be used for modern browsers).
iframe, noembed and noframes In HTML, these elements are parsed as CDATA elements. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used. Do not add content within these elements (or hide them on page load/DOMContentLoaded by JavaScript).
caption, col, colgroup, frame, frameset, head, option, optgroup, tbody, td, tfoot, th, thead, tr when appearing out of context In HTML, the tags for these elements, when appearing out of context, are ignored. (How so?) Do not use these elements out of context. In the case of <tr> directly inside a <table>, one may use an explicit tbody to avoid potential confusion.
plaintext This element has a special parsing requirement in HTML. (It is, however, forbidden.) Do not use plaintext.
pre, listing or textarea In HTML, a line feed that immediately follows any of these element's start tag is ignored. In XML, it is treated as other content. Add any line break before the element begins using HTML or CSS.
In head (base, link, meta), in body (area,br, col, embed, hr, img, input, param, and now also link and meta) These elements are void elements in HTML. In XHTML, these may use explicit closing tags as well as self-closing ones (just as non-void elements can). Do not use an explicit closing tag for these void elements to avoid double-inclusion when shown in HTML (and avoid self-closing tags on non-void elements which can sometimes accept content (such as <script>)).

HTML Elements with Optional Tags

For polyglot texts, always use the start and ending tag (unless it is a void element, in which case, self-closing tags must be used).

Element Start Tag End Tag
html optional optional
head optional optional
body optional optional
li required optional
dt required optional
dd required optional
p required optional
colgroup optional optional
thead required optional
tbody optional optional
tfoot required optional
tr required optional
th required optional
td required optional
rt required optional
rp required optional
optgroup required optional
option required optional

Scripts

Feature HTML Requirement XHTML Requirement Notes Guidance for XHTML-HTML compatibility
document.write() and document.writeln() Available in HTML. These cannot be used in XHTML. Use DOM methods to replace or add content dynamically.
innerHTML property Any HTML can be used. The use of this property requires that the string be a well-formed fragment of XML. Ensure one sets innerHTML to well-formed fragments.
DOM APIs and case sensitivity Some DOM APIs are case insensitive in HTML (which are sensitive?). (This does not apply to elements which are not in the HTML namespace.) DOM APIs are case sensitive in XHTML Use lower-case elements, attributes, and attribute values (or as appropriate with SVG camel-cased elements and attributes (and the "definitionURL" attribute should use proper casing when used in MathML)).
Element.tagName and Node.nodeName properties These properties return the value in uppercase in HTML. (Node.localName is consistent now, as of HTML5.) These properties return the value in lower-case in XHTML. For older browsers, compare after converting to lower case.
Document.createElement() Case insensitive Use the canonical form, lowercase, for polyglot documents.
Element.setAttributeNode() Changes the attribute name to lowercase. Do not expect to use upper-case attribute names.
Element.setAttribute() Case insensitive Use the canonical form, lowercase, for polyglot documents.
Document.getElementsByTagName() and Element.getElementsByTagName() Case insensitive in HTML Use the canonical form, lowercase, for polyglot documents.
Document.renameNode() If the new namespace is the HTML namespace, then the new qualified name will be lowercased before the rename takes place. Do not expect to keep upper-case attribute names for HTML-namespaced elements after a rename.
Document.createElement() and namespaces In HTML, this will create an element in the HTML namespace. In XML (including true XHTML), the namespace is defined by both DOM2 and DOM3 to be null. In XHTML, browsers lack interoperability in this area. In Firefox and Safari, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element. If operating within a browser which supports it, use Document.createElementNS to avoid the ambiguity.
XPath expressions In pre-HTML5 browsers, the XHTML namespace must be used for XHTML and null for HTML. (HTML5 browsers would use the XHTML namespace even in HTML.) In XHTML, all XPath will require a namespace unless the elements genuinely have no namespace. Detect whether the browser is pre-HTML5 and omit namespaces in XPath expressions if so (otherwise, use a namespace).

Stylesheets

Feature HTML Requirement XHTML Requirement Notes Guidance for XHTML-HTML compatibility
CSS Selectors Match case insensitively in HTML. Match case sensitively in XHTML For polyglot documents, use lower-case selectors or as appropriate (e.g., for SVG CamelCased items).
Styling of html/body elements CSS requires special handling of the body element in HTML for painting backgrounds on the canvas XHTML does not require special handling. Style the html and body elements appropriately (?).

Differences Between HTML4 and HTML5

See HTML5 differences from HTML4.

Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs

This section might belong on a separate page.

  • TODO (need to talk about the changes to the DOM API that HTML5 is making, compared with DOM2 and DOM3)

Translations