A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Parser tests: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
m (Try to make the parser tests page a bit more readable, and make spaces more obvious using <tt>.)
(The documentation now lives in the repo, so this can die.)
 
(16 intermediate revisions by 5 users not shown)
Line 1: Line 1:
=Parser Tests=
[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies. The parser tests live in the <code>tokenizer</code> and <code>tree-construction</code> directories, both of which contain README files describing the test format.
 
This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
 
==Tokenizer Tests==
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.
 
===Basic Structure===
 
{"tests": [
{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}
 
<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.
 
<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
 
<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:
 
["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"
 
<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.
 
When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.
 
<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
CDATA
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.
 
<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".
 
<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.
 
 
Multiple tests per file are allowed simply by adding more objects to the "tests" list.
 
All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.
 
=== Open Issues ===
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?
 
==Tree Construction Tests==
 
Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:
 
<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>
 
Where [TEST] is the following format:
 
Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.
 
Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.
 
Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.
 
Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent nodes, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
 
The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".
 
The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ".
 
If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".
 
For example:
<pre>
#data
<p>One<p>Two
#errors
3: Missing document type declaration
#document
| <html>
|  <head>
|  <body>
|    <p>
|      "One"
|    <p>
|      "Two"
</pre>
 
Tests can be found here: http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/
 
=== Open Issues ===
* should relax the order constraint?

Latest revision as of 16:56, 24 October 2013

html5lib-tests is a suite of unit tests for use by implementations of the HTML spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies. The parser tests live in the tokenizer and tree-construction directories, both of which contain README files describing the test format.