A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.
To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).
HTML5Lib: Difference between revisions
(Add note about testcases) |
(→Testcases: point to github repository) |
||
(22 intermediate revisions by 10 users not shown) | |||
Line 1: | Line 1: | ||
[https://github.com/html5lib HTML5Lib] is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license. | |||
[http://code.google.com/p/html5lib/ | From December 2006 to March 2013, development took place on [http://code.google.com/p/html5lib/ code.google.com]. | ||
Since April 2013, it has been at [https://github.com/html5lib github]. | |||
==SVN== | == SVN == | ||
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo | Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the [http://groups.google.com/group/html5lib-discuss mailing list]. For questions that could benefit from quick turnaround, talk to people on #whatwg. | ||
== General == | |||
In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction. | In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction. | ||
Line 14: | Line 13: | ||
In comments "AT" indicates that the comment documents an alternate implementation technique or strategy. | In comments "AT" indicates that the comment documents an alternate implementation technique or strategy. | ||
===HTMLTokenizer=== | == HTMLTokenizer == | ||
The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back. | |||
Currently tokens are objects, they will become dicts. | |||
=== Interface === | |||
The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled. | |||
=== Issues === | |||
* Use of if statements in the states may be suboptimal (but we should time this) | |||
== HTMLParser == | |||
=== Profiling on web-apps.htm === | |||
I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions: | |||
* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then... | |||
: We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless. | |||
:: That's an interesting idea. How would that work? [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC) | |||
::: I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. [[User:Annevk|Annevk]] 11:37, 26 December 2006 (UTC) | |||
* | * 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds. | ||
: I've just switched to the built-in sets type. hopefully this will help a bit [[User:Jgraham|Jgraham]] 00:30, 25 December 2006 (UTC) | |||
:: It did. (Not surprisingly when 700.000 method calls are gone...) [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC) | |||
* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds. | |||
: This is now the largest time consumer. [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC) | |||
* dataState in tokenizer.py with 0.7 CPU seconds is next. | |||
: This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. [[User:Annevk|Annevk]] | |||
* __iter__ in tokenizer.py with 0.59x CPU seconds... | |||
* Creation of all node objects in web-apps takes .57x CPU seconds. | |||
* etc. | |||
== Testcases == | |||
Testcases are in the [https://github.com/html5lib/html5lib-tests html5lib-tests repository]. They require [http://cheeseshop.python.org/pypi/simplejson simplejson]. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the [http://groups.google.com/group/html5lib-discuss mailing list] and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]]. | |||
[[Category:Implementations]] | [[Category:Implementations]] |
Latest revision as of 22:01, 28 August 2013
HTML5Lib is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.
From December 2006 to March 2013, development took place on code.google.com. Since April 2013, it has been at github.
SVN
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the mailing list. For questions that could benefit from quick turnaround, talk to people on #whatwg.
General
In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.
In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.
HTMLTokenizer
The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.
Currently tokens are objects, they will become dicts.
Interface
The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.
Issues
- Use of if statements in the states may be suboptimal (but we should time this)
HTMLParser
Profiling on web-apps.htm
I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:
- utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
- We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
- That's an interesting idea. How would that work? Annevk 12:49, 25 December 2006 (UTC)
- I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. Annevk 11:37, 26 December 2006 (UTC)
- That's an interesting idea. How would that work? Annevk 12:49, 25 December 2006 (UTC)
- 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
- I've just switched to the built-in sets type. hopefully this will help a bit Jgraham 00:30, 25 December 2006 (UTC)
- It did. (Not surprisingly when 700.000 method calls are gone...) Annevk 12:49, 25 December 2006 (UTC)
- 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
- This is now the largest time consumer. Annevk 12:49, 25 December 2006 (UTC)
- dataState in tokenizer.py with 0.7 CPU seconds is next.
- This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. Annevk
- __iter__ in tokenizer.py with 0.59x CPU seconds...
- Creation of all node objects in web-apps takes .57x CPU seconds.
- etc.
Testcases
Testcases are in the html5lib-tests repository. They require simplejson. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the mailing list and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at Parser_tests.