WHATWG Wiki - User contributions [en]

HTML vs. XHTML

2007-08-31T06:21:45Z

Usome: /* Translations */

== Differences Between HTML and XHTML ==

'''Please note that the information in here is based upon the current spec for (X)HTML5. Some of the issues technically do not apply to previous versions of HTML.'''

Although HTML and XHTML appear to have similarities in their syntax, they are significantly different in many ways.

'''Note''': As the current WHATWG document is a draft, this section will need to track to a moving target.
Differences marked @@@ are differences that could theoretically be changed without affecting
backwards compatibility.

=== MIME Types ===

* XHTML must be served with an XML MIME type, such as <code>application/xml</code> or <code>application/xhtml+xml</code>.
* HTML must be served as <code>text/html</code>.

It is the MIME type that determines what type of document you are using. If you use attempt to send XHTML as <code>text/html</code>, you are actually just using HTML, possibly with syntax errors.

Technically, according to the spec, XHTML 1.0 is allowed to be served as <code>text/html</code>. But, due to the above reason, such a document is considered to be an HTML document, not an XHTML document.

=== Parsing ===

XHTML uses XML parsing requirements. HTML uses its own which are defined much more closely to the way browsers actually handle HTML today.

* In XHTML, well-formedness errors are fatal. In HTML, error handling rules are much more graceful. Well-formedness errors, which are also syntax errors in HTML, include the following:
** Unencoded ampersands (<code>&</code> instead of <code>&amp;</code>), and less than signs (<code><</code> instead of <code>&lt;</code>) (This does not apply to <code>CDATA</code>). (Note: in HTML, an unencoded ampersand is allowed in some cases.)
** Comments containing extra pairs of hyphens or ending with a hyphen. e.g.
*** <code></code> or
*** <code></code>.
** Mismatched end tags (does not apply to elements with optional tags)
** Unclosed tags.
** Unexpected characters occuring in or before attribute names.
** Unexpected occurrence of EOF.
** Unexpected characters before the DOCTYPE name.
** Missing DOCTYPE name.
** A <code>PUBLIC</code> identifer in a <code>DOCTYPE</code> without a <code>SYSTEM</code> identifier (Note: including either of these is a syntax error in HTML5; but, in XML only the <code>SYSTEM</code> identifier is allowed to occur on its own).
** End tags with attributes.
** Unexpected end tags (in HTML, an unexpected <code></code> or <code></code> can cause the start tag to be implied before it).
* The internal subset is permitted in XML, but meaningless (and forbidden) in HTML.
** In some cases, an internal subset in HTML would end up being partly rendered inline.
* The sequence of characters "<code>]]></code>" in content when it does not mark the end of a <code>CDATA</code> section is a well-formedness error in XHTML, but valid in HTML.
* In XHTML: <code><![CDATA[...]]></code> is a <code>CDATA</code> section. In HTML, it's a bogus comment.
* In XHTML, <code><?foo ...?></code> is a processing instruction. In HTML, it's a bogus comment.
* In HTML, the trailing slash used for the empty element syntax is a parse error for non-void elements (see below), but is ignored in all cases.
* In HTML, the <code>script</code> and <code>style</code> elements are parsed as <code>CDATA</code>. (Note: the definition of <code>CDATA</code> differs from that in XML). In XML, they're parsed as normal elements (which means that comments are treated as real comments, and things that look like start tags actually are start tags).
* In HTML, the <code>title</code> and <code>textarea</code> elements are parsed as <code>RCDATA</code>. (Note: The definition of <code>RCDATA</code> differs from that in SGML and there is no <code>RCDATA</code> in XML).
* In HTML, if scripting is enabled, the <code>noscript</code> element is parsed as <code>CDATA</code>. If scripting is disabled, it's parsed as <code>PCDATA</code>. In XHTML, the element has no effect, and can't really be used to stop content from being present when script is disabled.
* In HTML, the <code>iframe</code>, <code>noembed</code> and <code>noframes</code> elements are parsed as <code>CDATA</code>. In XHTML, they are parsed as normal elements, and therefore do not stop content from being used.
* White space characters in attribute values are [http://www.w3.org/TR/REC-xml/#AVNormalize normalized] to spaces in XHTML.
* In HTML, elements with optional tags are implied in certain conditions.
* In HTML, <code>title</code> elements with tags occurring in the body are moved into the head. In XHTML, they stay where they were specified.
* In HTML, tags for certain elements, which appear out of context, are ignored. This includes <code>caption</code>, <code>col</code>, <code>colgroup</code>, <code>frame</code>, <code>frameset</code>, <code>head</code>, <code>option</code>, <code>optgroup</code>, <code>tbody</code>, <code>td</code>, <code>tfoot</code>, <code>th</code>, <code>thead</code>, <code>tr</code>.
* The <code>plaintext</code> element has a special parsing requirement in HTML. (It is, however, forbidden.)
* Many other special handling of edge cases and error conditions, not all of which are listed here, occur in HTML.

=== Syntax ===

* In HTML, [http://blog.whatwg.org/faq/#doctype the <code>doctype</code> is required]. In XHTML, it is optional.
* In XHTML, tag names and attribute names are case sensitive. In HTML, they are case insensitive.
* In XHTML, non-empty elements require both a start and an end tag. In HTML, certain elements allow the omission of either or both:
** <code>html</code> (both)
** <code>head</code> (both)
** <code>body</code> (both)
** <code>li</code> (end tag)
** <code>dt</code> (end tag)
** <code>dd</code> (end tag)
** <code>p</code> (end tag)
** <code>colgroup</code> (both)
** <code>thead</code> (end tag)
** <code>tbody</code> (both)
** <code>tfoot</code> (end tag)
** <code>tr</code> (end tag)
** <code>td</code> (end tag)
** <code>th</code> (end tag)
* In XHTML, empty elements may use either the empty element syntax (<code> </code>) or have an end tag immediately follow the start tag (<code> </code>). In HTML, the empty element syntax (trailing slash) is allowed on void elements, but forbidden on other elements. However, it serves no purpose whatsoever and can be omitted. End tags for void elements are forbidden.
** <code>base</code>,<code> link</code>, <code>meta</code>, <code>hr</code>, <code>br</code>, <code>img</code>, <code>embed</code>, <code>param</code>, <code>area</code>, <code>col</code> and <code>input</code>
** Note: the following are treated as void elements for the purpose in the parsing requirements, but, as they are obsolete and non-standard, the trailing slash is not permitted: <code>basefont</code>, <code>b</code><code>gsound</code>, <code>spacer</code>, <code>wbr</code>. (although, since these elements are not permitted anyway, it doesn't make much difference).
* HTML allows attribute minimisation (i.e. omitting the value), XHTML does not.
* HTML allows the use of unquoted attribute values, XHTML does not.
* XHTML allows the use of <code>CDATA</code> sections, HTML does not.
* XHTML allows the use of processing instructions, HTML does not.
* In HTML, all entity references are predefined and do not require a DTD. But because there is no DTD for XHTML5, entity references cannot be used in XHTML. (excluding the 5 predefined entities: <code>&amp;</code>, <code>&lt;</code>, <code>&gt;</code>, <code>&quot;</code> and <code>&apos;)</code>
** You may provide your own DTD for use with your own validating parser, but be aware that browsers do not use validating parsers and will not read the DTD.
* The valid set of unicode characters in XML 1.0 is limited beyond that in HTML.
* Namespace prefixes are permitted in XHTML. They are forbidden in HTML.

=== Markup ===

* The [http://blog.whatwg.org/faq/#namespace-decl namespace declaration] (<code>xmlns</code> attribute) is required in XHTML. The xmlns attribute is also allowed to appear on the <code>html</code> element in HTML on the condition that is has the value <code><nowiki>"http://www.w3.org/1999/xhtml"</nowiki></code>.
** <code><nowiki><html xmlns="http://www.w3.org/1999/xhtml"></nowiki></code>
** In HTML, the xmlns attribute has absolutely no effect. It is basically a talisman. It is allowed merely to make migration to and from XHTML mildly easier. When parsed by an HTML parser, the attribute ends up in the null namespace
** In XML (with an [http://www.w3.org/TR/xml-names/ XML Namespaces]-aware parser), an xmlns attribute is part of the namespace declaration mechanism, and an element cannot actually have an xmlns attribute in the null namespace. In DOM implementations, the attribute ends up in the "<code><nowiki>http://www.w3.org/2000/xmlns/</nowiki></code>" namespace.
* XHTML allows non XHTML elements and attributes (in different namespaces) to be used, HTML does not.
* XHTML uses the <code>xml:lang</code> attribute, HTML uses <code>lang</code> instead,
* XML ID introduces <code>xml:id</code>, which could be used in XHTML. In HTML it has no effect.
* In HTML, the <code>noscript</code> element may be used. In XHTML, it is forbidden.
* HTML uses the <code>base</code> element, XHTML uses <code>xml:base</code> instead.
* In XHTML, <code>p</code> elements may contain structured inline level elements including <code>blockquote</code>, <code>dl</code>, <code>menu</code>, <code>ol</code>, <code>ul</code>, <code>pre</code> and <code>table</code>. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).
* In XHTML, <code>table</code> elements may contain child <code>tr</code> elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).

=== Character Encoding ===

* In XHTML, the XML declaration may be used to [http://blog.whatwg.org/faq/#charset specify the character encoding]. In HTML, the XML declaration is forbidden
* In HTML, the <code>meta</code> element with a <code>charset</code> attribute may be used instead. It is forbidden in XHTML and is ignored if included.
* The default character encoding for XHTML is, according to XML rules, <code>UTF-8</code> or <code>UTF-16</code>. If the encoding is unspecified in HTML, it should be determined through implementation specific heuristics or fallback to a default value (Note: this section of the spec is not yet finished).

=== Scripts ===

* <code>document.write()</code> and <code>document.writeln()</code> cannot be used in XHTML, they can in HTML.
* In XHTML, the use of the <code>innerHTML</code> property requires that the string be a well-formed fragment of XML.
* DOM APIs are case sensitive in XHTML and some are case insensitive in HTML. (This does not apply to elements which are not in the HTML namespace)
** Element.tagName, Node.nodeName, and Node.localName return the value in uppercase.
** Document.createElement() is case insensitive (the canonical form is lowercase).
** Element.setAttributeNode() will change the attribute name to lowercase.
** Element.setAttribute() is case insensitive (the canonical form is lowercase).
** Document.getElementsByTagName() and Element.getElementsByTagName() are case insensitive.
** Document.renameNode(). If the new namespace is the HTML namespace, then the new qualified name must be lowercased before the rename takes place.
* In HTML, Document.createElement() will create an element in the HTML namespace. In XML (including XHTML), the namespace is defined by both DOM2 and DOM3 to be null.
** In XHTML, browsers lack interoperability in this area. In Firefox, the namespace is dependent upon the MIME type. In Opera, it's dependent upon the root element and in Safari, it's always null.

=== Stylesheets ===

* Selectors, as used in CSS, match case sensitively in XHTML, but case insensitively in HTML.
* CSS requires special handling of the body element in HTML for painting backgrounds on the canvas, which do not apply to XHTML.

== Differences Between HTML 4.01 and HTML 5 ==

See [[Differences from HTML4]].

== Differences Between DOM Level 2.0, 3.0 and the HTML 5 DOM APIs ==

'''This section might belong on a separate page.'''

* TODO (need to talk about the changes to the DOM API that HTML5 is making, compared with DOM2 and DOM3)

== Translations ==

* [http://meiert.com/de/publications/translations/whatwg.org/html-vs-xhtml/ German translation: "HTML 5 und XHTML 5 im Vergleich (WHATWG)"][http://www.usome.com]

Link Hashes

2007-08-31T06:20:40Z

Usome: /* Content-MD5 HTTP Header */

Many download sites, especially for software download, give hashes or digests for the file they distribute so that users can check the validity of the files once they've downloaded it. The process for verifying the hash however isn't straightforward.

== Problem Description ==
A lot of software download pages already give you MD5 or SHA-1 digests values to check the validity of the downloaded file. Checking the file ensure that the downloaded file is same as the author of the page wanted to give you. Corrupted or tampered files can be detected that way.

The problem is that there is no way to automate that verification process. To automate this process, a browser would need to extract the hash associated with the link on the original page.

=== Current Usage ===
Some links to software download pages featuring hashes:
* Apple: [http://www.apple.com/support/downloads/securityupdate20060061039client.html Security Update 2006-006]
* [http://www.php.net/downloads.php PHP Downloads]
* Apache: [http://httpd.apache.org/download.cgi HTTP Server]

Other examples can be found on the [http://microformats.org/wiki/hash-examples#Who_offers_MD5.2FSHA-1_checksums_with_software hash examples] page on the Microformat wiki.

=== Benefits ===
Easier discoverability of tampered files which could come from a mirror server being hacked.

== Proposed Solutions ==

=== hash attribute ===
A hash attribute could contain a md5 checksum of the target file. If the hash of the downloaded file does not match the one from the link, the file is deleted or quarantined and the user is alerted of a potential security risk.

<pre>
<a href="..." hash="b3187253c1667fac7d20bb762ad53967">
</pre>

==== Processing Model ====
When the link is clicked, the browser keeps the hash in memory to compare it with the it hashes from the downloaded file. Once the file is downloaded, the the computed hash is compared against the expected hash.

:"To be completed: what to do about non-download links, like links to other pages, when they have a hash?"

==== Limitations ====
:''Cases not covered by this solution in relation to the problem description; other problems with this solution, if any.''

==== Implementation ====
The software industry as a whole is more and more concerned about security implications of the Internet. Security has become another feature of the browser. Something that increase security with minor impact to the user experience will probably be welcome.

A browser could display the following message when in case of hash mismatch:

:''File "image.iso" is different from the file linked on page "My Software CD Images". It is possible that this file has been tampered with and it'd be advisable to not open it. Do you wish to delete the file?'' [Delete File] [Keep in Quarantine]

==== Adoption ====
Distributors that already give hashes for their users to verify the files are very likely to add this extra attribute if it simplifies the security checks for their users. The fact that the digests are already available on these pages means that the author of the page is already concerned about security of the transfered file.

=== Hash Microformat ===
The hash microformat provides a way to associate hash values with links:

<pre>

<a rel="bookmark" href="...">Download OpenOffice.org
e0d123e5f316bef78bfdf5a008837577
</a>

</pre>

The microformat is better described on the [http://microformats.org/wiki/hash-examples hash-examples] page.

==== Processing Model ====
When a link is clicked, the browser check if it corresponds to the microformat (''details to be added''). If it is the hash value is extracted and, once the file is downloaded, the computed hash for the file is compared against the expected hash. Browsers should keep the initial hash value across redirections, if any. This only applies to files downloaded to the disk.

:"Could the syntax be extended so that fragment identifiers could cohabit with fingerprints?"

==== Limitations ====
:''Cases not covered by this solution in relation to the problem description; other problems with this solution, if any.''

==== Implementation ====
The software industry as a whole is more and more concerned about security implications of the Internet. Security has become another feature of the browser. Something that increase security with minor impact to the user experience will probably be welcome.

A browser could display the following message when in case of hash mismatch:

:''File "image.iso" is different from the file linked on page "My Software CD Images". It is possible that this file has been tampered with and it'd be advisable to not open it. Do you wish to delete the file?'' [Delete File] [Keep in Quarantine]

==== Adoption ====
Distributors that already give hashes for their users to verify the files are very likely to add this extra attribute if it simplifies the security checks for their users. The fact that the digests are already available on these pages means that the author of the page is already concerned about security of the transfered file.

The microformat markup is heavier that it needs to be. It also force page authors to put the hash visible inside the link, or to apply specific stylesheets to hide it on visual browsers.

=== Link Fingerprint ===
Append a digest for the file in the fragment identifier of the URL. The browser can then check the validity of the file when it downloads it.

<pre>
http://example.com/file#!md5!b3187253c1667fac7d20bb762ad53967
</pre>

The [http://www.gerv.net/security/link-fingerprints/ Link Fingerprints article] by Gervase Markham gives more details.

==== Processing Model ====
When the link is clicked, the browser check if the URL contains a hash. If the URL contains a hash, once the file is downloaded the computed hash is compared against the expected hash. Browsers should keep the initial hash value across redirections, if any. This only applies to files downloaded to the disk.

:"Could the syntax be extended so that fragment identifiers could cohabit with fingerprints?"

==== Limitations ====
Work only for downloaded files; fragment identifiers are used in other ways for regular pages and PDF files opened in the browser with a plugin.

==== Implementation ====
The software industry as a whole is more and more concerned about security implications of the Internet. Security has become another feature of the browser. Something that increase security with minor impact to the user experience will probably be welcome.

A browser could display the following message when in case of hash mismatch:

:''File "image.iso" is different from the file linked on page "My Software CD Images". It is possible that this file has been tampered with and it'd be advisable to not open it. Do you wish to delete the file?'' [Delete File] [Keep in Quarantine]

==== Adoption ====
Distributors that already give hashes for their users to verify the files are very likely to add this extra attribute if it simplifies the security checks for their users. The fact that the digests are already available on these pages means that the author of the page is already concerned about security of the transfered file.

=== Content-MD5 HTTP Header ===
It has been suggested to use the [http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15 Content-MD5][http://www.usome.com] HTTP header. A tampered file on a hacked server is very likely to get its digest updated accordingly however.

== Mailing List References ==
* [http://listserver.dreamhost.com/pipermail/whatwg-whatwg.org/2006-November/007833.html Re: hash attribute] -- Tom Pike, Wed Nov 8 05:21:22 PST 2006
* [http://listserver.dreamhost.com/pipermail/whatwg-whatwg.org/2006-November/007903.html Re: hash attribute] -- Ian Hickson, Wed Nov 8 08:28:19 PST 2006
* [http://listserver.dreamhost.com/pipermail/whatwg-whatwg.org/2006-November/007857.html Re: hash attribute] -- Gervase Markham, Thu Nov 9 09:23:32 PST 2006
* [http://listserver.dreamhost.com/pipermail/whatwg-whatwg.org/2006-November/007903.html Re: hash attribute] -- Michel Fortin, Tue Nov 14 08:53:43 PST 2006

[[Category:Feature Request]]

HTML5Lib

2007-08-31T06:18:31Z

Usome: /* Testcases */

[http://code.google.com/p/html5lib/ HTML5Lib] is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

== SVN ==
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the [http://groups.google.com/group/html5lib-discuss mailing list]. For questions that could benefit from quick turnaround, talk to people on #whatwg.

== General ==

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

== HTMLTokenizer ==

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

=== Interface ===

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

=== Issues ===
* Use of if statements in the states may be suboptimal (but we should time this)

== HTMLParser ==

=== Profiling on web-apps.htm ===

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
: We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
:: That's an interesting idea. How would that work? [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)
::: I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. [[User:Annevk|Annevk]] 11:37, 26 December 2006 (UTC)

* 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
: I've just switched to the built-in sets type. hopefully this will help a bit [[User:Jgraham|Jgraham]] 00:30, 25 December 2006 (UTC)
:: It did. (Not surprisingly when 700.000 method calls are gone...) [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
: This is now the largest time consumer. [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* dataState in tokenizer.py with 0.7 CPU seconds is next.
: This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. [[User:Annevk|Annevk]]

* __iter__ in tokenizer.py with 0.59x CPU seconds...

* Creation of all node objects in web-apps takes .57x CPU seconds.

* etc.

== Testcases ==
Testcases are under the /tests directory. They require [http://cheeseshop.python.org/pypi/simplejson simplejson]. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the [http://groups.google.com/group/html5lib-discuss mailing list] and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]][http://www.usome.com].

[[Category:Implementations]]

Main Page

2007-08-24T10:18:48Z

Usome: /* Communicating with the community */

Welcome to the WHATWG Wiki!

You can be a part of our community, making proposals for the next version of HTML5. This wiki is made available for you for drafting proposals, for writing essays, for keeping track of HTML-related issues, and so forth. Anyone can create an account and contribute content.

Before you begin, you may wish to read our [[WHATWG Wiki:Contribution Guidelines|contribution guidelines]].

==Purpose==
The purpose of the WHATWG Wiki is to create a place for WHATWG contributors to post and compile their own proposals and ideas regarding WHATWG specifications. The specifications themselves will not be available for editing via this wiki. However, ideas you post here may find their way into current and future WHATWG specifications.

== Main sections and Quick links ==
* [[Implementations]]
* [[What you can do]]
* [[Differences from HTML4|HTML5 differences from HTML4]]
* [[HTML vs. XHTML]]
* [[HTML5 Presentations]]
* [[Feature Proposals]]

==WHATWG Specifications==
* [[HTML 5]]
* [[Web Forms 2.0]]
* [[Web Controls 1.0]]

==Communicating with the community==
The WHATWG community has several channels of communication:
* [http://www.whatwg.org/mailing-list Mailing lists]
* [http://blog.whatwg.org/ The blog]
* [http://wiki.whatwg.org/ This wiki]
* [http://www.usome.com IRC]
* [http://forums.whatwg.org/ The forum]