Revision as of 23:10, 11 December 2006

Parser Tests

This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.

Tokenizer Tests

The test format is json. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

Basic Structure

{"tests":
[

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]}

]
}

input_string is a string literal containing the input string to pass to the tokenizer expected_output_tokens is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the complete list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, error?]
["StartTag", name, {attributes}])
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"
"AtheistParseError"

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

Open Issues

Is the format too verbose?
Do we want to allow the starting content model flag of the tokenizer to be specified (e.g. through a "contentModel" field in the test objects?
Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?
Do we want the "AtheistParseError"?

Tree Construction Tests

There can be multiple tests per file. Each test must begin with a line that says "#data". All subsequent lines until "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed. Then there must be a line that says "#errors:". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, the only thing that matters is that there be the right number of parse errors. Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node. Element nodes must be represented by a "<" then the tag name then ">", and all the attributes must be given, in alphabetical order, on subsequent nodes, as if they were children of the element node. Attribute nodes must have the attribute name, then an "=" sign, then the attribute value in double quotes ("). Text nodes must be the string, in double quotes. Newlines aren't escaped. Comments must be "". DOCTYPEs must be "<!DOCTYPE " then the name then ">".

For example:

#data
<p>One<p>Two
#errors
3: Missing document type declaration
#document
| <html>
|   <head>
|   <body>
|     <p>
|       "One"
|     <p>
|       "Two"

Parser tests: Difference between revisions

Revision as of 23:10, 11 December 2006

Contents

Parser Tests

Tokenizer Tests

Basic Structure

Open Issues

Tree Construction Tests

Navigation menu

Parser tests: Difference between revisions

Revision as of 23:10, 11 December 2006

Parser Tests

Tokenizer Tests

Basic Structure

Open Issues

Tree Construction Tests

Navigation menu

Search