A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
HTML vs. XHTML
Differences Between HTML and XHTML
Please note that the information in here is based upon the current spec for (X)HTML5. Some of the issues technically do not apply to previous versions of HTML.
Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.
Note: As the current WHATWG document is a draft, this section will need to track to a moving target. Differences marked @@@ are differences that could theoretically be changed without affecting backwards compatibility.
MIME Types
- XHTML must be served with an XML MIME type, such as
application/xml
orapplication/xhtml+xml
. - HTML must be served as
text/html
.
It is the MIME type that determines what type of document you are using. If you use attempt to send XHTML as text/html
, you are actually just using HTML, possibly with syntax errors.
Technically, according to the spec, XHTML 1.0 is allowed to be served as text/html
. But, due to the above reason, such a document is considered to be an HTML document, not an XHTML document.
Parsing
XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.
- In XHTML, well-formedness errors are fatal. In HTML, error handling rules are much more graceful. Well-formedness errors, which are also syntax errors in HTML, include the following:
- Unencoded ampersands (
&
instead of&
), and less than signs (<
instead of<
) (This does not apply toCDATA
). - Comments containing extra pairs of hyphens or ending with a hyphen. e.g.
<!-- syntax -- error -->
or<!-- syntax error --->
.
- Mismatched end tags (does not apply to elements with optional tags)
- Unclosed tags.
- Unexpected characters occuring in or before attribute names.
- Unexpected occurrence of EOF.
- Unexpected characters before the DOCTYPE name.
- Missing DOCTYPE name.
- A
PUBLIC
identifer in aDOCTYPE
without aSYSTEM
identifier (Note: including either of these is a syntax error in HTML5; but, in XML only theSYSTEM
identifier is allowed to occur on its own). - End tags with attributes.
- Unexpected end tags (in HTML, an unexpected
</br>
or</p>
can cause the start tag to be implied before it).
- Unencoded ampersands (
- The internal subset is permitted in XML, but meaningless (and forbidden) in HTML.
- In some cases, an internal subset in HTML would end up being partly rendered inline.
- The sequence of characters "
]]>
" when it does not mark the end of aCDATA
section is a well-formedness error in XHTML, but valid in HTML. - In XHTML:
<![CDATA[...]]>
is aCDATA
section. In HTML, it's a bogus comment. - In XHTML,
<?foo ...?>
is a processing instruction. In HTML, it's a bogus comment. - In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
- In HTML, the
script
andstyle
elements are parsed asCDATA
. (Note: the definition ofCDATA
differs from that in XML). In XML, they're parsed as normal elements (which means that comments are treated as real comments, and things that look like start tags actually are start tags). - In HTML, the
title
andtextarea
elements are parsed asRCDATA
. (Note: The definition ofRCDATA
differs from that in SGML and there is noRCDATA
in XML). - In HTML, if scripting is enabled, the
noscript
element is parsed asCDATA
. If scripting is disabled, it's parsed asPCDATA
. In XHTML, the element has no effect, and can't really be used to stop content from being present when script is disabled. - In HTML, the
iframe
,noembed
andnoframes
elements are parsed asCDATA
. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used. - White space characters in attribute values are normalized to spaces in XHTML.
- Elements with optional tags are implied in certain conditions.
- In HTML,
base
,link
,meta
,style
andtitle
elements with tags occurring in the body are moved inserted into the head. In XHTML, they stay where they were specified. - In HTML, tags for certain elements, which appear out of context, are ignored. This includes
caption
,col
,colgroup
,frame
,frameset
,head
,option
,optgroup
,tbody
,td
,tfoot
,th
,thead
,tr
. - The
plaintext
element has a special parsing requirement in HTML. (it is, however, forbidden). - Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML.
Syntax
- In HTML, the
doctype
is required. In XHTML, it is optional. - In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
- In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
html
(both)head
(both)body
(both)li
(end tag)dt
(end tag)dd
(end tag)p
(end tag)colgroup
(both)thead
(end tag)tbody
(both)tfoot
(end tag)tr
(end tag)td
(end tag)th
(end tag)
- In XHTML, empty elements may use either the empty element syntax (
<br/>
) or have an end tag immediately follow the start tag (<br></br>
). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and can be omitted. End tags for void elements are forbidden.base
,link
,meta
,hr
,br
,img
,embed
,param
,area
,col
andinput
- Note: the following are treated as void elements for the purpose in the parsing requirements, but, as they are obsolete and non-standard, the trailing slash is not permitted:
basefont
,b
gsound
,spacer
,wbr
. (although, since these elements are not permitted anyway, it doesn't make much difference).
- HTML allows attribute minimisation (i.e. omitting the value), XHTML does not.
- HTML allows the use of unquoted attribute values, XHTML does not.
- XHTML allows the use of
CDATA
sections, HTML does not. - XHTML allows the use of processing instructions, HTML does not.
- In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities:
&
,<
,>
,"
and')
- You may provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
- The valid set of unicode characters in XML 1.0 is limited beyond that in HTML.
- Namespace prefixes are permitted in XHTML. They are forbidden in HTML.
Markup
- The namespace declaration (
xmlns
attribute) is required in XHTML. The xmlns attribute is also allowed to appear on thehtml
element in HTML on the condition that is has the value"http://www.w3.org/1999/xhtml"
.<html xmlns="http://www.w3.org/1999/xhtml">
- In HTML, the xmlns attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the attribute ends up in the null namespace
- In XML (with an XML Namespaces-aware parser), an xmlns attribute is part of the namespace declaration mechanism, and an element cannot actually have an xmlns attribute in the null namespace. In DOM implementations, the attribute ends up in the "
http://www.w3.org/2000/xmlns/
" namespace.
- XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not.
- XHTML uses the
xml:lang
attribute, HTML useslang
instead, - XML ID introduces
xml:id
, which could be used in XHTML. In HTML it has no effect. - In HTML, the
noscript
element may be used. In XHTML, it is forbidden. - HTML uses the
base
element, XHTML usesxml:base
instead. - In XHTML,
p
elements may contain structured inline level elements includingblockquote
,dl
,menu
,ol
,ul
,pre
andtable
. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation). - In XHTML,
table
elements may contain childtr
elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
Character Encoding
- In XHTML, the XML declaration may be used to specify the character encoding. In HTML, the xml declaration is forbidden
- In HTML, the
meta
element may be used insted. Thehttp-equiv
attribute on themeta
element is forbidden in XHTML and is ignored if included. - The default character encoding for XHTML is, according to XML rules,
UTF-8
orUTF-16
. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).
Scripts
document.write()
anddocument.writeln()
cannot be used in XHTML, they can in HTML.- In XHTML, the use of the
innerHTML
property requires that the string be a well-formed fragment of XML. - DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace)
- Element.tagName, Node.nodeName, and Node.localName return the value in uppercase.
- Document.createElement() is case insensitive (the canonical form is lowercase).
- Element.setAttributeNode() will change the attribute name to lowercase.
- Element.setAttribute() is case insensitive (the canonical form is lowercase).
- Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
- Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name must be lowercased before the rename takes place.
- In HTML, Document.createElement() will create an element in the HTML namespace. In XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
- In XHTML, browsers lack interoperability in this area. In Firefox, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element and in Safari, it's always null.
Stylesheets
- Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
- CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.
Differences Between HTML 4.01 and HTML 5
MIME Type
Both HTML 4.01 and HTML 5 use text/html
.
Content-Type: text/html; charset=UTF-8
Parsing HTML
- HTML 2.0 to HTML 4.01 were formally based on SGML, but browsers did not implement SGML parsers. 4.2 SGML and B.3 SGML implementation notes, HTML 4.01. This is a non-normative section of HTML 4.01 specification. And it already makes the difference between HTML user agents and SGML user agents.
- HTML 5 is defines its own parsing requirements based on the way browsers actually handle HTML.
Syntax
- TODO
Markup
Obsolete Attributes
Some attributes that were defined in HTML4 are not included in HTML5. Here's a current list (subject to change, see the spec):
- html@version
- head@profile
- a@rev, link@rev
- a@target, area@target, base@target, form@target (is mentioned in WF2...), link@target
- a@charset, link@charset, script@charset
- table@summary
- td@headers, th@headers
- td@axis, th@axis
- param@valuetype
- object@standby
- meta@scheme
- object@archive
In addition, HTML5 has none of the presentational attributes that were in HTML4 (including those on <table>). Any attributes defined on elements that are not in HTML5 are (obviously) also not in HTML5.
Obsolete Elements
The following elements were present in HTML4 but are not defined in HTML5:
- acronym (use instead)
- applet (use <object> instead)
- basefont
- big
- center
- dir
- font
- frame
- frameset
- isindex
- noframes
- noscript (only in XHTML)
- s
- strike
- tt
- u
Character Encoding
HTML 4 Algorithm
Source 5.2.2 Specifying the character encoding, HTML 4.01 Specification.
- An HTTP "charset" parameter in a "Content-Type" field.
- A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
- The charset attribute set on an element that designates an external resource.
HTML 5 Algorithm
The exact algorithm that browsers must follow in order to determine the character encoding is specified in HTML 5. The basic algorithm works as follows:
- If the transport layer specifies an encoding, use that, and abort these steps. (e.g. The HTTP Content-Type header).
- Read the first 512 bytes of the file, or at least as much as possible if less than that.
- If the file starts with a UTF-8, UTF-16 or UTF-32 BOM, then use that and abort these steps.
- Otherwise use the special algorithm to search the first 512 bytes for a meta element that declares the encoding. The algorithm is relatively lenient in what it will detect, though since it doesn't use the normal parsing algorithm, there are some restrictions.
Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs
This section might belong on a separate page.
- TODO (need to talk about the changes to the DOM API that HTML5 is making, compared with DOM2 and DOM3)
Translations
- German translation: In progress (Jens Meiert)