WHATWG Wiki - User contributions [en]

Parser tests

2013-08-31T19:27:51Z

Jmdyck: /* Tokenizer Tests */ Add section for xmlViolation tests

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"initialStates":[initial_states],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>test.input</tt> is a string containing the characters to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

If <tt>test.doubleEscaped</tt> is present and <tt>true</tt>, then <tt>test.input</tt> is not quite as described above.
Instead, it must first be subjected to another round of unescaping (i.e., in addition to any unescaping involved in the JSON import), and the result of ''that'' represents the characters of the input stream.
Currently, the only unescaping required by this option is to convert each sequence of the form \uHHHH (where H is a hex digit) into the corresponding Unicode code point.
(Note that this option also affects the interpretation of <tt>test.output</tt>.)

<tt>test.initialStates</tt> is a list of strings, each being the name of a tokenizer state.
The test should be run once for each string, using it to set the tokenizer's initial state for that run.
If <tt>test.initialStates</tt> is omitted, it defaults to <tt>["data state"]</tt>.

<tt>test.lastStartTag</tt> is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this tokenizer", referenced in the spec's definition of '''appropriate end tag token'''. If it is omitted, it is treated as if "no start tag has been emitted from this tokenizer".

===Test results===

<tt>test.output</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

If <tt>test.doubleEscaped</tt> is present and <tt>true</tt>, then every string within <tt>test.output</tt> must be further unescaped (as described above) before comparing with the tokenizer's output.

<tt>test.ignoreErrorOrder</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== xmlViolation tests ==

<tt>tokenizer/xmlViolation.test</tt> differs from the above in a couple of ways:
* The name of the single member of the top-level JSON object is "xmlViolationTests" instead of "tests".
* Each test's expected output assumes that implementation is applying the tweaks given in the spec's "Coercing an HTML DOM into an infoset" section.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T19:13:00Z

Jmdyck: /* Basic Structure */ describe 'doubleEscaped'

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"initialStates":[initial_states],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>test.input</tt> is a string containing the characters to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

If <tt>test.doubleEscaped</tt> is present and <tt>true</tt>, then <tt>test.input</tt> is not quite as described above.
Instead, it must first be subjected to another round of unescaping (i.e., in addition to any unescaping involved in the JSON import), and the result of ''that'' represents the characters of the input stream.
Currently, the only unescaping required by this option is to convert each sequence of the form \uHHHH (where H is a hex digit) into the corresponding Unicode code point.
(Note that this option also affects the interpretation of <tt>test.output</tt>.)

<tt>test.initialStates</tt> is a list of strings, each being the name of a tokenizer state.
The test should be run once for each string, using it to set the tokenizer's initial state for that run.
If <tt>test.initialStates</tt> is omitted, it defaults to <tt>["data state"]</tt>.

<tt>test.lastStartTag</tt> is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this tokenizer", referenced in the spec's definition of '''appropriate end tag token'''. If it is omitted, it is treated as if "no start tag has been emitted from this tokenizer".

===Test results===

<tt>test.output</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

If <tt>test.doubleEscaped</tt> is present and <tt>true</tt>, then every string within <tt>test.output</tt> must be further unescaped (as described above) before comparing with the tokenizer's output.

<tt>test.ignoreErrorOrder</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T18:53:24Z

Jmdyck: /* Tokenizer Tests */ I think it's confusing to refer to a component of a test using an identifier that isn't the component's actual name, so eliminate that.

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"initialStates":[initial_states],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>test.input</tt> is a string containing the characters to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>test.initialStates</tt> is a list of strings, each being the name of a tokenizer state.
The test should be run once for each string, using it to set the tokenizer's initial state for that run.
If <tt>test.initialStates</tt> is omitted, it defaults to <tt>["data state"]</tt>.

<tt>test.lastStartTag</tt> is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this tokenizer", referenced in the spec's definition of '''appropriate end tag token'''. If it is omitted, it is treated as if "no start tag has been emitted from this tokenizer".

===Test results===

<tt>test.output</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>test.ignoreErrorOrder</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T18:40:47Z

Jmdyck: /* Test set-up */ make a stronger association between lastStartTag and the spec

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"initialStates":[initial_states],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>initial_states</tt> is a list of strings, each being the name of a tokenizer state.
The test should be run once for each string, using it to set the tokenizer's initial state for that run.
If <tt>initial_states</tt> is omitted, it defaults to <tt>["data state"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag to have been emitted from this tokenizer", referenced in the spec's definition of '''appropriate end tag token'''. If it is omitted, it is treated as if "no start tag has been emitted from this tokenizer".

===Test results===

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T18:35:07Z

Jmdyck: /* Basic Structure */ replace 'contentModelFlags' with 'initialStates'

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"initialStates":[initial_states],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>initial_states</tt> is a list of strings, each being the name of a tokenizer state.
The test should be run once for each string, using it to set the tokenizer's initial state for that run.
If <tt>initial_states</tt> is omitted, it defaults to <tt>["data state"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

===Test results===

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T18:26:40Z

Jmdyck: /* Basic Structure */ fix typo: misplaced right-brace

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens],
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
}
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

===Test results===

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-31T18:25:35Z

Jmdyck: /* Basic Structure */ Indent the JSON code for readability.

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

===Test results===

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-30T22:44:12Z

Jmdyck: /* Basic Structure */ Rearrange paragraphs more logically. Add headers "Test set-up" and "Test results".

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

===Test set-up===

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

===Test results===

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-28T23:02:32Z

Jmdyck: /* Basic Structure */ Clarify "input_string".

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.
Specifically, it represents the characters of the '''input stream''', and so implementations are expected to perform the processing described in the spec's '''Preprocessing the input stream''' section before feeding the result to the tokenizer.

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-28T22:26:46Z

Jmdyck: Describe + link to the tests repository before saying how this page relates to it.

[https://github.com/html5lib/html5lib-tests html5lib-tests] is a suite of unit tests for use by implementations of the HTML5 parsing spec.
The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.
This page documents the various test formats that are used within the suite.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

Parser tests

2013-08-28T22:22:17Z

Jmdyck: Remove an unnecessary level of grouping.

This page documents the unit-test format(s) being used for implementations of the HTML5 parsing spec. The aim is to produce implementation-independent, self-describing tests that can be shared between any groups working on these technologies.

=Tokenizer Tests=
The test format is [http://www.json.org/ JSON]. This has the advantage that the syntax allows backward-compatible extensions to the tests and the disadvantage that it is relatively verbose.

==Basic Structure==

{"tests": [

{"description":"Test description",
"input":"input_string",
"output":[expected_output_tokens]},
"contentModelFlags":[content_model_flags],
"lastStartTag":last_start_tag,
"ignoreErrorOrder":ignore_error_order
]}

<tt>description</tt>, <tt>input</tt> and <tt>output</tt> are always present. The other values are optional.

<tt>input_string</tt> is a string literal containing the input string to pass to the tokenizer.

<tt>expected_output_tokens</tt> is a list of tokens, ordered with the first produced by the tokenizer the first (leftmost) in the list. The list must mach the '''complete''' list of tokens that the tokenizer should produce. Valid tokens are:

["DOCTYPE", name, public_id, system_id, correctness]
["StartTag", name, {attributes}'', true'']
["StartTag", name, {attributes}]
["EndTag", name]
["Comment", data]
["Character", data]
"ParseError"

<tt>public_id</tt> and <tt>system_id</tt> are either strings or <tt>null</tt>. <tt>correctness</tt> is either <tt>true</tt> or <tt>false</tt>; <tt>true</tt> corresponds to the force-quirks flag being false, and vice-versa.

When the self-closing flag is set, the <tt>StartTag</tt> array has <tt>true</tt> as its fourth entry. When the flag is not set, the array has only three entries for backwards compatibility.

<tt>content_model_flags</tt> is a list of strings from the set:
PCDATA
RCDATA
RAWTEXT
PLAINTEXT
The test case applies when the tokenizer begins with its content model flag set to any of those values. If <tt>content_model_flags</tt> is omitted, it defaults to <tt>["PCDATA"]</tt>.

<tt>last_start_tag</tt> is a lowercase string that should be used as "the tag name of the last start tag token emitted" in the tokenizer algorithm. If it is omitted, it is treated as if "no start tag token has ever been emitted by this instance of the tokeniser".

<tt>ignore_error_order</tt> is a boolean value indicating that the order of <tt>ParseError</tt> tokens relative to other tokens in the output stream is unimportant, and implementations should ignore such differences between their output and <tt>expected_output_tokens</tt>. (This is used for errors emitted by the input stream preprocessing stage, since it is useful to test that code but it is undefined when the errors occur). If it is omitted, it defaults to <tt>false</tt>.

Multiple tests per file are allowed simply by adding more objects to the "tests" list.

All adjacent character tokens are coalesced into a single <tt>["Character", data]</tt> token.

== Open Issues ==
* Is the format too verbose?
* Do we want to allow the test to pass if only a subset of the actual tokens emitted matches the expected_output_tokens list?

=Tree Construction Tests=

Each file containing tree construction tests consists of any number of tests separated by two newlines (LF) and a single newline before the end of the file. For instance:

<pre>[TEST]LF
LF
[TEST]LF
LF
[TEST]LF</pre>

Where [TEST] is the following format:

Each test must begin with a string "#data" followed by a newline (LF). All subsequent lines until a line that says "#errors" are the test data and must be passed to the system being tested unchanged, except with the final newline (on the last line) removed.

Then there must be a line that says "#errors". It must be followed by one line per parse error that a conformant checker would return. It doesn't matter what those lines are, although they can't be "#document-fragment", "#document", or empty, the only thing that matters is that there be the right number of parse errors.

Then there *may* be a line that says "#document-fragment", which must be followed by a newline (LF), followed by a string of characters that indicates the context element, followed by a newline (LF). If this line is present the "#data" must be parsed using the HTML fragment parsing algorithm with the context element as context.

Then there must be a line that says "#document", which must be followed by a dump of the tree of the parsed DOM. Each node must be represented by a single line. Each line must start with "| ", followed by two spaces per parent node that the node has before the root document node.
* Element nodes must be represented by a "<tt><</tt>" then the ''tag name string'' "<tt>></tt>", and all the attributes must be given, sorted lexicographically by UTF-16 code unit according to their ''attribute name string'', on subsequent lines, as if they were children of the element node.
* Attribute nodes must have the ''attribute name string'', then an "=" sign, then the attribute value in double quotes (").
* Text nodes must be the string, in double quotes. Newlines aren't escaped.
* Comments must be "<tt><</tt>" then "<tt>!-- </tt>" then the data then "<tt> --></tt>".
* DOCTYPEs must be "<tt><!DOCTYPE </tt>" then the name then if either of the system id or public id is non-empty a space, public id in double-quotes, another space an the system id in double-quotes, and then in any case "<tt>></tt>".
* Processing instructions must be "<tt><?</tt>", then the target, then a space, then the data and then "<tt>></tt>". (The HTML parser cannot emit processing instructions, but scripts can, and the WebVTT to DOM rules can emit them.)

The ''tag name string'' is the local name prefixed by a namespace designator. For the HTML namespace, the namespace designator is the empty string, i.e. there's no prefix. For the SVG namespace, the namespace designator is "svg ". For the MathML namespace, the namespace designator is "math ".

The ''attribute name string'' is the local name prefixed by a namespace designator. For no namespace, the namespace designator is the empty string, i.e. there's no prefix. For the XLink namespace, the namespace designator is "xlink ". For the XML namespace, the namespace designator is "xml ". For the XMLNS namespace, the namespace designator is "xmlns ". Note the difference between "xlink:href" which is an attribute in no namespace with the local name "xlink:href" and "xlink href" which is an attribute in the xlink namespace with the local name "href".

If there is also a "#document-fragment" the bit following "#document" must be a representation of the HTML fragment serialization for the context element given by "#document-fragment".

For example:
<pre>
#data
OneTwo
#errors
3: Missing document type declaration
#document
| <html>
| <head>
| <body>
| 
| "One"
| 
| "Two"
</pre>

Tests can be found here: http://code.google.com/p/html5lib/source/browse/#hg%2Ftestdata%2Ftree-construction

== Open Issues ==
* should relax the order constraint?

HTML5Lib

2013-08-28T22:01:37Z

Jmdyck: /* Testcases */ point to github repository

[https://github.com/html5lib HTML5Lib] is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

From December 2006 to March 2013, development took place on [http://code.google.com/p/html5lib/ code.google.com].
Since April 2013, it has been at [https://github.com/html5lib github].

== SVN ==
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the [http://groups.google.com/group/html5lib-discuss mailing list]. For questions that could benefit from quick turnaround, talk to people on #whatwg.

== General ==

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

== HTMLTokenizer ==

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

=== Interface ===

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

=== Issues ===
* Use of if statements in the states may be suboptimal (but we should time this)

== HTMLParser ==

=== Profiling on web-apps.htm ===

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
: We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
:: That's an interesting idea. How would that work? [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)
::: I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. [[User:Annevk|Annevk]] 11:37, 26 December 2006 (UTC)

* 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
: I've just switched to the built-in sets type. hopefully this will help a bit [[User:Jgraham|Jgraham]] 00:30, 25 December 2006 (UTC)
:: It did. (Not surprisingly when 700.000 method calls are gone...) [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
: This is now the largest time consumer. [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* dataState in tokenizer.py with 0.7 CPU seconds is next.
: This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. [[User:Annevk|Annevk]]

* __iter__ in tokenizer.py with 0.59x CPU seconds...

* Creation of all node objects in web-apps takes .57x CPU seconds.

* etc.

== Testcases ==
Testcases are in the [https://github.com/html5lib/html5lib-tests html5lib-tests repository]. They require [http://cheeseshop.python.org/pypi/simplejson simplejson]. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the [http://groups.google.com/group/html5lib-discuss mailing list] and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]].

[[Category:Implementations]]

HTML5Lib

2013-08-28T21:54:51Z

Jmdyck: Update re project move to github.

[https://github.com/html5lib HTML5Lib] is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

From December 2006 to March 2013, development took place on [http://code.google.com/p/html5lib/ code.google.com].
Since April 2013, it has been at [https://github.com/html5lib github].

== SVN ==
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the [http://groups.google.com/group/html5lib-discuss mailing list]. For questions that could benefit from quick turnaround, talk to people on #whatwg.

== General ==

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

== HTMLTokenizer ==

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

=== Interface ===

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

=== Issues ===
* Use of if statements in the states may be suboptimal (but we should time this)

== HTMLParser ==

=== Profiling on web-apps.htm ===

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
: We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
:: That's an interesting idea. How would that work? [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)
::: I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. [[User:Annevk|Annevk]] 11:37, 26 December 2006 (UTC)

* 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
: I've just switched to the built-in sets type. hopefully this will help a bit [[User:Jgraham|Jgraham]] 00:30, 25 December 2006 (UTC)
:: It did. (Not surprisingly when 700.000 method calls are gone...) [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
: This is now the largest time consumer. [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)

* dataState in tokenizer.py with 0.7 CPU seconds is next.
: This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. [[User:Annevk|Annevk]]

* __iter__ in tokenizer.py with 0.59x CPU seconds...

* Creation of all node objects in web-apps takes .57x CPU seconds.

* etc.

== Testcases ==
Testcases are under the /tests directory. They require [http://cheeseshop.python.org/pypi/simplejson simplejson]. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the [http://groups.google.com/group/html5lib-discuss mailing list] and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]].

[[Category:Implementations]]