https://wiki.whatwg.org/api.php?action=feedcontributions&user=Philip+Taylor&feedformat=atomWHATWG Wiki - User contributions [en]2024-03-29T05:57:57ZUser contributionsMediaWiki 1.39.3https://wiki.whatwg.org/index.php?title=Testsuite&diff=3516Testsuite2009-01-27T16:11:12Z<p>Philip Taylor: Suggest a non-requirement</p>
<hr />
<div>Existing tests URI: http://dev.w3.org/html5/tests/<br />
<br />
== Requirements ==<br />
<br />
* Each test needs a "reviewed" marker of some sort<br />
* It must be easy to find tests where the spec has changed under them<br />
* The barrier to contribution must be as low as possible<br />
* Testcases should have somewhat stable URIs<br />
* If test can be done using JavaScript preferably require it to be in JavaScript so engines can be more efficiently tested (i.e. automated).<br />
<br />
== Non-requirements ==<br />
<br />
* There does not need to be a single consistent test harness for the whole of HTML5. (When sections can be tested in isolation, each section should use a test harness that is suited to that section's testing requirements. E.g. there is little value in fitting canvas tests and parser tests into the same framework, and it may add a lot of complexity.)<br />
<br />
== Existing tests ==<br />
* [http://samples.msdn.microsoft.com/ietestcenter/ IE tests]<br />
* [http://philip.html5.org/tests/canvas/suite/tests/ Philip's canvas tests]<br />
* [http://code.google.com/p/html5lib/source/browse/trunk/testdata/ html5lib tests]</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Validator.nu_Servlet_Overview&diff=3377Validator.nu Servlet Overview2008-10-03T10:58:06Z<p>Philip Taylor: /* <code>nu.validator.servletfilter.InboundSizeLimitFilter</code. */ Fixed markup error</p>
<hr />
<div>==The <code>Main</code> Class==<br />
<br />
Validator.nu has its own <code>main()</code> method in a class called <code>nu.validator.servlet.Main</code>. This makes makes debugging and isolated deployment an order of magnitude easier than doing XML situps to make application server load the right bits.<br />
<br />
The <code>main()</code> method does the following thing:<br />
<br />
# Initializes log4j<br />
# Instantiates <code>VerifierServletTransaction</code> to trigger its static initializer early.<br />
# Instantiates Jetty.<br />
# Sets up an HTTP or AJP13 connector.<br />
# Builds a servlet [[#The_Filters|filter]] chain.<br />
# Adds the servlet to the server.<br />
# Starts the server.<br />
<br />
If you want to run the servlet in a larger application server, the only mandatory step you need to take care of before the servlet loads is initializing log4j. The [[#The_Filters|filter]] chain is optional (but without it some non-core features do not work; see below).<br />
<br />
==The Servlet==<br />
<br />
Validator.nu is encapsulated in one servlet: <code>nu.validator.servlet.VerifierServlet</code>. This servlet handles the generic facet, the HTML5 facet and the parsetree facet and does URI dispatching and decides which controller class to instantiate. <br />
<br />
Servlets are by default required to be re-entrant, so for programming convenience the servlet instantiates controller object whose lifetime is limited to one HTTP request.<br />
<br />
==The Filters==<br />
<br />
Some non-core features are implemented as servlet filters. These features are inbound and outbound gzip compression, support for HTML form-based file uploads and textarea-based input and limiting the input data size before performing decompression and before performing form POST decoding.<br />
<br />
The filter from outer (closer to container) to inner (closer to the servlet) are:<br />
<br />
===<code>org.mortbay.servlet.GzipFilter</code>===<br />
<br />
Implements response compression.<br />
<br />
===<code>nu.validator.servletfilter.InboundSizeLimitFilter</code>===<br />
<br />
This filter throws a <code>nu.validator.io.StreamBoundException</code> if the request entity body is too large. This filter throttles the input for <code>nu.validator.servletfilter.InboundGzipFilter</code> and <code>nu.validator.servlet.MultipartFormDataFilter</code>. If those filters are not in use and the servlet container makes sure that POSTed content is really limited by <code>Content-Length</code> if present, this one isn’t needed, either.<br />
<br />
===<code>nu.validator.servletfilter.InboundGzipFilter</code>===<br />
<br />
Implements request decompression.<br />
<br />
===<code>nu.validator.servlet.MultipartFormDataFilter</code>===<br />
<br />
Implements support for HTML form-based file upload and textarea input by exposing these to the servet as if the document were POSTed straight as the entity body.<br />
<br />
==The Controllers==<br />
<br />
===<code>VerifierServletTransaction</code>===<br />
<br />
The bulk of the Validator.nu UI controller and random glue that holds it all together is in <code>nu.validator.servlet.VerifierServletTransaction</code>. This is probably the ugliest class in Validator.nu; UI-related code tends to be uglier than back end code and the class has grown organically over time.<br />
<br />
Most of the initialization of Validator.nu is performed in the static initializer of this class. The default <code>Main</code> triggers early initialization by instantiating this class once before starting the HTTP server.<br />
<br />
===<code>Html5ConformanceCheckerTransaction</code>===<br />
<br />
This is a subclass of <code>VerifierServletTransaction</code> that tweaks the overall behavior just enough to implement the HTML5 facet of Validator.nu.<br />
<br />
===<code>ParseTreePrinter</code>===<br />
<br />
This is the controller for parsetree.validator.nu.<br />
<br />
<br />
[[Category:Validator.nu Documentation]]</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Talk:HTML_vs._XHTML&diff=2875Talk:HTML vs. XHTML2008-03-21T18:57:57Z<p>Philip Taylor: revert spam</p>
<hr />
<div>An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3. And that the proper way to tell the two apart is via MIME types.<br />
<br />
There are only two problems with that. XHTML is not as different from HTML as RDF/XML is from N3. And MIME types can't be relied on. Let's take each in turn.<br />
<br />
=== Syntax ===<br />
<br />
* Both N3 and RDF/XML are used to express sets of RDF triples. They are equally capable: every triple store can be dumped into either format. The analogy here is the DOM. It is not currently the case that every DOM tree can be dumped equally capably into either format.<br />
* N3 and RDF/XML are not the same, nor do they even look similar. They are different from top to bottom. Not only are no N3 documents valid RDF/XML, there are no individual triples that can be expressed the same way in both formats.<br />
** You need to explain how RDF/N3 is relevant! --[[User:Lachlan Hunt|Lachlan Hunt]] 04:43, 4 December 2006 (UTC)<br />
*** The top of this page starts with "An often repeated assertion is that XHTML is as different from HTML as RDF/XML is from N3". Need I provide references? -[[User:Rubys|Rubys]] 14:08, 4 December 2006 (UTC)<br />
<br />
=== Mime Types ===<br />
<br />
* People have consistently proven that they can't be trusted to configure and set MIME types correctly. Most aren't even aware that MIME types exist. The default setup with Apache is to not allow overrides. One popular use case is for documentation that is served via <code>file:///</code> URIs directly from your hard disk.<br />
** <code>file:///</code> URIs use an OS or browser specific mechanism to determine the MIME. On Windows, for instance (for IE), the file extension is mapped to a MIME type via a key in the registry. --[[User:Lachlan Hunt|Lachlan Hunt]] 11:16, 4 December 2006 (UTC)<br />
*** and as such, can rarely be depended upon. In addition to file extensions, content sniffing is also a common strategy. -[[User:Rubys|Rubys]] 14:12, 4 December 2006 (UTC)<br />
* HTTP as specified indicates that the the <code>Content-Type</code> header is authoritative - it trumps the XML prolog. HTTP as practiced treats the MIME type as a hint. Whether it be feeds or WMV files, users have an expectation as to what happens when they click on these links, and are unhappy when the browser lets them down.<br />
** For compatibility, those issues with several file formats do, unfortunately, have to be retained. However, breaking Content-Type in that way for <code>text/html</code> to somehow allow the content to be treated as XML instead is not an option. --[[User:Lachlan Hunt|Lachlan Hunt]] 11:16, 4 December 2006 (UTC)<br />
<br />
It isn't clear to me (Hixie), however, how the fact that authors can't set the MIME type properly is supposed to be something we can ever solve from the point of view of the syntax of HTML. The full XML syntax isn't compatible with HTML parsers, and the full HTML syntax isn't compatible with XML parsers. The common subset is a tiny language that doesn't support widely used features like <style> or scripting. We can't parse text/html files as anything but HTML. The parser used for content sent with XML MIME types is out of scope for the WHATWG specs (it would be up to the XML guys). It isn't that we WANT the MIME type to be the only way to distinguish the two. It's that the MIME type IS the only way. It's a statement of fact, not desire. [[User:Hixie|Hixie]] 18:24, 4 December 2006 (UTC)<br />
* [http://planet.intertwingly.net/ my planet] is served as <code>application/xhtml+xml</code> to Firefox and <code>text/html</code> to IE. It seems to be capable of doing both scripting and style in both modes. -[[User:Rubys|Rubys]]<br />
<br />
=== Ideals ===<br />
<br />
In an ideal word:<br />
* the syntax of XML and HTML would be either complete identical or completely different.<br />
** The syntax of HTML and XHTML are completely different. The fact that they look similar on the surface is irrelevant. (see above). --[[User:Lachlan Hunt|Lachlan Hunt]] 04:43, 4 December 2006 (UTC)<br />
*** Completely? I'd say that they are as different as en-us and en-au. :-) -[[User:Rubys|Rubys]]<br />
* the set of DOM trees that could be serialized as XHTML and HTML would either be completely identical or completely different.<br />
** This is not possible without breaking backwards compatibility. These incompatibilities have existed between HTML and XHTML for a long time, and that hasn't stopped people serialising their XHTML as HTML up until now (for all practical purposes, serving XHTML as text/html is equivalent to reserialising). --[[User:Lachlan Hunt|Lachlan Hunt]] 11:16, 4 December 2006 (UTC)<br />
* <code>Content-Type</code> would either always be respected, or always be ignored.<br />
** <code>Content-Type</code> is always respected for for HTML and XHTML MIME types. It's not for some others, but that's a different issue --[[User:Lachlan Hunt|Lachlan Hunt]] 11:16, 4 December 2006 (UTC)<br />
*** Always? Try serving your feed as text/html to FireFox 2.0. -[[User:Rubys|Rubys]]<br />
*** Try serving your feed as text/html to *any browser* with feed support. [[User:Sayrer|Sayrer]]<br />
* there would either be a fool-proof way to "sniff" whether the a given content was HTML or XHTML; or there would be no difference between XHTML and HTML in terms of syntax and range of DOM trees that could validly be serialized would also be identical.<br />
** There is a foolproof way... the MIME type. :-) -Hixie<br />
<br />
=== Analysis ===<br />
<br />
Obviously, the current situation is less than ideal. XML and HTML evolved from a common ancestor. XML isn't changing. And the constraint to be as backwards compatible with HTML4 as humanly possible places practical limits on what can be done. Neither being absolutely identical with the XML syntax nor being completely different are options.<br />
<br />
At the present time, the HTML5 syntax is a (near) superset of the XHTML syntax. Yet the situation is (nearly) reversed for the set of DOM trees that can be serialized into XHTML is larger than the set of DOM trees that can be serialized into HTML5.<br />
<br />
Having the syntaxes being substantially similar leads to confusion in some edge cases (e.g., <code><p/></code>) but also has some advantages. Similar syntaxes would make things easier for people who have become disillusioned with XHTML and wish to migrate to HTML5. Conversely, similar syntaxes would make incremental migration from HTML5 to XHTML5 easier for those who wish to take advantage of the greater set of DOM trees that can be represented in that syntax.<br />
<br />
=== Potential Strategies ===<br />
<br />
'''Note''': these strategies are not necessarily mutually-exclusive.<br />
<br />
* Develop better tools and actively work to integrate them into products like WordPress and DreamWeaver. (We're doing this already. -Hixie)<br />
* The definition of HTML5 understandably and correctly puts a higher weight on HTML4 compatibility than XHTML migration. But as a migration aid, identify some unlikely/invalid combination (example: use of the HTML5 DOCTYPE combined with <code>xmlns</code> attribute on the <code>html</code> element combined with the use of a non-xml MIME type) and adjust some (as of yet undefined) set of the HTML5 parsing rules.<br />
* Document these differences, either in the spec itself (as a non-normative appendix?) and/or by having a conformance checker flag these differences. Variations:<br />
** Ensure that each of these differences triggers a [http://www.whatwg.org/specs/web-apps/current-work/#parse parse error] or equivalent in HTML5; this does not (necessarily) involve changing the recovery action or the way the document is ultimately parsed.<br />
** Instead of bothering people who may not care about these differences, identify some unlikely combination (such as the DOCTYPE/xmlns/MIME combination above) and have it trigger a '''pedantic mode''' which enables these additional checks.<br />
<br />
=== table inside p? ===<br />
<br />
Do you really mean the following?<br />
<br />
<blockquote><br />
In XHTML, p elements may contain structured inline level elements including blockquote, dl, menu, ol, ul, pre and table<br />
</blockquote><br />
<br />
In what respect are blockquote, dl, menu, ol, ul, and table “inline,” and how are they allowed inside p?<br />
– [[User:Joeclark|Joeclark]] 05:07, 5 December 2006 (UTC)<br />
<br />
Yes, in XHTML5, as opposed to HTML5, the content model for <code>p</code> elements has been modified to allow [http://www.whatwg.org/specs/web-apps/current-work/#structured structured inline-level] elements. However, it's not allowed in HTML5 because of backwards compatibility constraints. The problem is that the end tag for the <code>p</code> element will be implied by the presence of those elements, so it's technically impossible to do, except through DOM manipulation.<br />
<br />
The term structured inline-level elements just refers to elements that a usually thought of as being block level, but may be used in inline-level contexts.<br />
<br />
Because XHTML isn't constrained by the same compatibility constraints as HTML, this now allows structures like the following:<br />
<br />
<code>&lt;p&gt;Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<br />
<br />
&lt;table&gt;<br />
&lt;tr&gt;<br />
&lt;th&gt;Cras est neque&lt;/th&gt;<br />
&lt;th&gt;Posuere id, lacinia eu&lt;/th&gt;<br />
&lt;/tr&gt;<br />
&lt;tr&gt;<br />
&lt;td&gt;Morbi eu neque.&lt;/td&gt;<br />
&lt;td&gt;Vivamus malesuada arcu &lt;/td&gt;<br />
&lt;/tr&gt;<br />
&lt;tr&gt;<br />
&lt;td&gt;luctus et ultrices&lt;/td&gt;<br />
&lt;td&gt;posuere cubilia&lt;/td&gt;<br />
&lt;/tr&gt;<br />
&lt;/table&gt; <br />
<br />
Nam id odio vitae enim tempor tincidunt. Sed orci. Nulla facilisi.&lt;/p&gt;</code><br />
<br />
All of those elements listed are defined to be allowed where strcutred inline-level content is allowed. This is a change from HTML 4.01 and XHTML 1.0, and is similar to model proposed in XHTML 2.0. -- [[User:Lachlan Hunt|Lachlan Hunt]] 13:39, 5 December 2006 (UTC)<br />
<br />
=== table cannot have tr child? ===<br />
<br />
Do you really mean the following:<br />
<br />
<blockquote><br />
In XHTML, table elements may contain child tr elements. In the HTML serialisation, due to backwards compatibility constraints, this is not possible (though it may be done through DOM manipulation).<br />
</blockquote><br />
<br />
So, in HTML, is this not possible?<br />
<br />
<pre><br />
<br />
<table><br />
<tr><br />
<td></td><br />
<td></td><br />
</tr><br />
</table><br />
<br />
</pre></div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=SVG_and_canvas&diff=2771SVG and canvas2007-12-02T14:04:59Z<p>Philip Taylor: Added various SVG/canvas notes</p>
<hr />
<div>Numerous people have suggested that SVG and <canvas> are competing technologies. They are not. SVG provides retained mode graphics and <canvas> provides immediate mode graphics. A lot of use cases can be addressed by either, but some are easier to do with one of them and some can only be done with one of them, and other cases are better when using both.<br />
<br />
For any specific case, it is not always obvious which approach is best. This page lists a number of differences which may be relevant when designing a solution.<br />
<br />
===Advantages of SVG===<br />
<br />
; Editable static images<br />
: SVG can be saved and loaded by standard vector editing programs. It is not possible to load <canvas> scripts, since they are not written declaratively.<br />
; Accessibility<br />
: SVG can store additional non-visual information inside its normal structure ─ e.g. titles can be displayed as tooltips to give users more information when they move their mouse over an object, or can be read by speech synthesisers. <canvas> supports only a single alternative for the whole element, and authors must explicitly construct an accessible alternative in a language such as HTML.<br />
; High-quality printing<br />
: <canvas> renders onto a fixed-resolution bitmap, usually matching the screen's resolution. If the output is moved to a higher-resolution device (such as a printer), the <canvas> bitmap will appear pixellated, whereas SVG can automatically render a new high-resolution image.<br />
; Interaction<br />
: SVG can automatically detect interaction, such as clicking on an object. <canvas> can't, and authors must write their own code to convert mouse coordinates into the appropriate action.<br />
; Mixing markup<br />
: SVG can embed content from other namespaces, e.g. to include an XHTML fragment inside an image. <canvas> can't.<br />
; Text<br />
: There is no standard way to draw text in <canvas>. Alternative approaches are possible (e.g. overlaying HTML elements containing text, or drawing each character as a bitmap image) but are harder to use and less flexible. SVG does support embedded text.<br />
<br />
===Advantages of <canvas>===<br />
<br />
; Script-based scene graph<br />
: If the scene is already stored as objects in script (e.g. to run a simple physics simulation on it, or to load it dynamically over a network) and does not exactly match the rendering hierarchy, it is easier to traverse that structure and render it using <canvas>, than to maintain a parallel SVG scene graph.<br />
; Programmatic generation of images<br />
: <canvas> is designed for creating images dynamically in scripts. SVG focuses on pre-computed image documents, and is more complex and slower to generate dynamically.<br />
; Drawing pixels<br />
: <canvas> works with pixels, and exposes them through functions like getImageData/putImageData. SVG works only with vectors, and pixel drawing has to be emulated inefficiently with tiny rectangles.<br />
; Constant performance<br />
: The memory usage of <canvas> is constant, whereas SVG uses more memory as you add more shapes. The time taken to draw a single shape onto a <canvas> is independent of what you have drawn before, whereas SVG becomes slower as the number of visible objects increases.<br />
<br />
==Combining SVG and <canvas>==<br />
<br />
There are cases where a combination is useful: e.g. a <canvas>-based game might load sprites from SVG images generated by a vector art program, to benefit from scalability and reduced download size compared to a PNG image; or a paint program might write its user interface in SVG, with an embedded <canvas> for the user to draw onto.<br />
<br />
* Some implementations (Opera 9.5?) support SVG in <img>s, which can be used in the <canvas> drawImage method.<br />
* Some implementations (future version of Opera?) support an extension of drawImage, where the argument is an SVGSVGElement instead of an HTMLImageElement.<br />
* Some implementations (Opera 9.5? Firefox 3?) support <foreignObject> in SVG, which can contain XHTML that uses <canvas>.</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Main_Page&diff=2477Main Page2007-08-23T01:26:50Z<p>Philip Taylor: </p>
<hr />
<div>Welcome to the WHATWG Wiki!<br />
<br />
You can be a part of our community, making proposals for the next version of HTML5. This wiki is made available for you for drafting proposals, for writing essays, for keeping track of HTML-related issues, and so forth. Anyone can create an account and contribute content.<br />
<br />
Before you begin, you may wish to read our [[WHATWG Wiki:Contribution Guidelines|contribution guidelines]].<br />
<br />
==Purpose==<br />
The purpose of the WHATWG Wiki is to create a place for WHATWG contributors to post and compile their own proposals and ideas regarding WHATWG specifications. The specifications themselves will not be available for editing via this wiki. However, ideas you post here may find their way into current and future WHATWG specifications.<br />
<br />
== Main sections and Quick links ==<br />
* [[Implementations]]<br />
* [[What you can do]]<br />
* [[Differences from HTML4|HTML5 differences from HTML4]]<br />
* [[HTML vs. XHTML]]<br />
* [[HTML5 Presentations]]<br />
* [[Feature Proposals]]<br />
<br />
==WHATWG Specifications==<br />
* [[HTML 5]]<br />
* [[Web Forms 2.0]]<br />
* [[Web Controls 1.0]]<br />
<br />
==Communicating with the community==<br />
The WHATWG community has several channels of communication:<br />
* [http://www.whatwg.org/mailing-list Mailing lists]<br />
* [http://blog.whatwg.org/ The blog]<br />
* [http://wiki.whatwg.org/ This wiki]<br />
* [[IRC]]<br />
* [http://forums.whatwg.org/ The forum]</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Parser_tests&diff=2451Parser tests2007-08-17T21:52:15Z<p>Philip Taylor: /* Tokenizer Tests */</p>
<hr />
<div>=Parser Tests=<br />
<br />
This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.<br />
<br />
==Tokenizer Tests==<br />
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.<br />
<br />
===Basic Structure===<br />
<br />
{"tests": [<br />
<br />
{"description":"Test description",<br />
"input":"input_string",<br />
"output":[expected_output_tokens]},<br />
"contentModelFlags":[content_model_flags],<br />
"lastStartTag":last_start_tag,<br />
"ignoreErrorOrder":ignore_error_order<br />
]}<br />
<br />
<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.<br />
<br />
<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.<br />
<br />
<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:<br />
<br />
["DOCTYPE", name, public_id, system_id, correctness]<br />
["StartTag", name, {attributes}])<br />
["EndTag", name]<br />
["Comment", data]<br />
["Character", data]<br />
"ParseError"<br />
<br />
<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> (correct) or <tt>false</tt> (incorrect).<br />
<br />
<tt>content_model_flags</tt> is a list of strings from the set:<br />
PCDATA<br />
RCDATA<br />
CDATA<br />
PLAINTEXT<br />
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.<br />
<br />
<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".<br />
<br />
<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.<br />
<br />
<br />
Multiple tests per file are allowed simply by adding more objects to the "tests" list.<br />
<br />
All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.<br />
<br />
=== Open Issues ===<br />
* Is the format too verbose?<br />
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?<br />
<br />
==Tree Construction Tests==<br />
<br />
Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:<br />
<br />
<pre>[TEST]LF<br />
LF<br />
[TEST]LF<br />
LF<br />
[TEST]LF</pre><br />
<br />
Where [TEST] is the following format:<br />
<br />
Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed. Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, the only thing that matters is that there be the right number of parse errors. Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node. Element nodes must be represented by a "<" then the tag name then ">", and all the attributes must be given, sorted lexicographically by UTF-16 code unit, on subsequent nodes, as if they were children of the element node. Attribute nodes must have the attribute name, then an "=" sign, then the attribute value in double quotes ("). Text nodes must be the string, in double quotes. Newlines aren't escaped. Comments must be "<" then "!-- " then the data then " -->". DOCTYPEs must be "<!DOCTYPE " then the name then ">".<br />
<br />
For example:<br />
<pre><br />
#data<br />
<p>One<p>Two<br />
#errors<br />
3: Missing document type declaration<br />
#document<br />
| <html><br />
| <head><br />
| <body><br />
| <p><br />
| "One"<br />
| <p><br />
| "Two"<br />
</pre><br />
<br />
Tests can be found here: http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Parser_tests&diff=2450Parser tests2007-08-17T21:49:53Z<p>Philip Taylor: /* Tokenizer Tests */</p>
<hr />
<div>=Parser Tests=<br />
<br />
This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.<br />
<br />
==Tokenizer Tests==<br />
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.<br />
<br />
===Basic Structure===<br />
<br />
{"tests": [<br />
<br />
{"description":"Test description",<br />
"input":"input_string",<br />
"output":[expected_output_tokens]},<br />
"contentModelFlags":[content_model_flags],<br />
"lastStartTag":last_start_tag,<br />
"ignoreErrorOrder":ignore_error_order<br />
]}<br />
<br />
<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.<br />
<br />
<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.<br />
<br />
<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:<br />
<br />
["DOCTYPE", name, public_id, system_id, correctness]<br />
["StartTag", name, {attributes}])<br />
["EndTag", name]<br />
["Comment", data]<br />
["Character", data]<br />
"ParseError"<br />
<br />
<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> (correct) or <tt>false</tt> (incorrect).<br />
<br />
<tt>content_model_flags</tt> is a list of strings from the set:<br />
PCDATA<br />
RCDATA<br />
CDATA<br />
PLAINTEXT<br />
The test applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.<br />
<br />
<tt>last_start_tag</tt> is a lowercased string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".<br />
<br />
<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.<br />
<br />
<br />
Multiple tests per file are allowed simply by adding more objects to the "tests" list.<br />
<br />
All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.<br />
<br />
=== Open Issues ===<br />
* Is the format too verbose?<br />
* Do we want to allow the starting content model flag of the tokenizer to be specified (e.g. through a "contentModel" field in the test objects?<br />
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?<br />
<br />
==Tree Construction Tests==<br />
<br />
Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:<br />
<br />
<pre>[TEST]LF<br />
LF<br />
[TEST]LF<br />
LF<br />
[TEST]LF</pre><br />
<br />
Where [TEST] is the following format:<br />
<br />
Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed. Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, the only thing that matters is that there be the right number of parse errors. Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node. Element nodes must be represented by a "<" then the tag name then ">", and all the attributes must be given, sorted lexicographically by UTF-16 code unit, on subsequent nodes, as if they were children of the element node. Attribute nodes must have the attribute name, then an "=" sign, then the attribute value in double quotes ("). Text nodes must be the string, in double quotes. Newlines aren't escaped. Comments must be "<" then "!-- " then the data then " -->". DOCTYPEs must be "<!DOCTYPE " then the name then ">".<br />
<br />
For example:<br />
<pre><br />
#data<br />
<p>One<p>Two<br />
#errors<br />
3: Missing document type declaration<br />
#document<br />
| <html><br />
| <head><br />
| <body><br />
| <p><br />
| "One"<br />
| <p><br />
| "Two"<br />
</pre><br />
<br />
Tests can be found here: http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/</div>Philip Taylorhttps://wiki.whatwg.org/index.php?title=Parser_tests&diff=2449Parser tests2007-08-17T21:36:46Z<p>Philip Taylor: /* Tokenizer Tests */ modernising description</p>
<hr />
<div>=Parser Tests=<br />
<br />
This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.<br />
<br />
==Tokenizer Tests==<br />
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.<br />
<br />
===Basic Structure===<br />
<br />
{"tests": [<br />
<br />
{"description":"Test description",<br />
"contentModelFlags":[content_model_flags],<br />
"lastStartTag":last_start_tag,<br />
"input":"input_string",<br />
"output":[expected_output_tokens]},<br />
"ignoreErrorOrder":ignore_order<br />
]}<br />
<br />
<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.<br />
<br />
<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:<br />
<br />
["DOCTYPE", name, public_id, system_id, correctness]<br />
["StartTag", name, {attributes}])<br />
["EndTag", name]<br />
["Comment", data]<br />
["Character", data]<br />
"ParseError"<br />
<br />
<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> (correct) or <tt>false</tt> (incorrect).<br />
<br />
Multiple tests per file are allowed simply by adding more objects to the "tests" list.<br />
<br />
All adjacent character tokens are coalesced into a single ["Character", data] token.<br />
<br />
=== Open Issues ===<br />
* Is the format too verbose?<br />
* Do we want to allow the starting content model flag of the tokenizer to be specified (e.g. through a "contentModel" field in the test objects?<br />
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?<br />
<br />
==Tree Construction Tests==<br />
<br />
Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:<br />
<br />
<pre>[TEST]LF<br />
LF<br />
[TEST]LF<br />
LF<br />
[TEST]LF</pre><br />
<br />
Where [TEST] is the following format:<br />
<br />
Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed. Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, the only thing that matters is that there be the right number of parse errors. Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node. Element nodes must be represented by a "<" then the tag name then ">", and all the attributes must be given, sorted lexicographically by UTF-16 code unit, on subsequent nodes, as if they were children of the element node. Attribute nodes must have the attribute name, then an "=" sign, then the attribute value in double quotes ("). Text nodes must be the string, in double quotes. Newlines aren't escaped. Comments must be "<" then "!-- " then the data then " -->". DOCTYPEs must be "<!DOCTYPE " then the name then ">".<br />
<br />
For example:<br />
<pre><br />
#data<br />
<p>One<p>Two<br />
#errors<br />
3: Missing document type declaration<br />
#document<br />
| <html><br />
| <head><br />
| <body><br />
| <p><br />
| "One"<br />
| <p><br />
| "Two"<br />
</pre><br />
<br />
Tests can be found here: http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/</div>Philip Taylor