A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

HTML5Lib: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(switch to getToken?)
(reflect reality a bit and add some content on profiling web-apps)
Line 14: Line 14:
In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.
In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.


===HTMLTokenizer===
=== HTMLTokenizer ===


The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a parser argument. For this project that is represented by the HTMLParser. self.parser points to that. The parser has to implement the following methods in order to work:
The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.


* processDoctype(name, error)
Currently tokens are objects, they will become dicts.
* processStartTag(tagname, attributes{})
* processEndTag()(tagname)
* processComment(data)
* processCharacter(data)
* processEOF()


====Interface====
====Interface====


You need to invoke the "tokenize" method with a dataStream argument in order to start.
The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.
 
The parser can also change the self.contentModelFlag attribute which affects how certain states are handled.


====Issues====
====Issues====
* Should keep track of line+col number for error reporting
* Use of if statements in the states may be suboptimal (but we should time this)
* Use of if statements in the states may be suboptimal (but we should time this)
* '''Is it worth going trying to switch to passing tokens to the parser instead of events? So the parser can do "for token in self.tokenizer" or something?'''


==Testcases==
==Testcases==
Testcases are under the /tests directory. They require [http://cheeseshop.python.org/pypi/simplejson simplejson] and can optionally be run under the [http://nose.python-hosting.com/ nose] unit-test framework. New code should not be checked in if it regresses previously functional unit tests. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]].
Testcases are under the /tests directory. They require [http://cheeseshop.python.org/pypi/simplejson simplejson] and can optionally be run under the [http://nose.python-hosting.com/ nose] unit-test framework. New code should not be checked in if it regresses previously functional unit tests. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]].
=== HTMLParser ===
==== Profiling on web-apps.htm ====
I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:
* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
* 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
* dataState in tokenizer.py with 0.7 CPU seconds is next.
* __iter_ in tokenizer.py with 0.59x CPU seconds...
* Creation of all node objects in web-apps takes .57x CPU seconds.
* etc.




[[Category:Implementations]]
[[Category:Implementations]]

Revision as of 21:36, 24 December 2006

HTML5Lib

HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

SVN

Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo work talk to people on #whatwg.

Design

General

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

HTMLTokenizer

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

Interface

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

Issues

  • Use of if statements in the states may be suboptimal (but we should time this)

Testcases

Testcases are under the /tests directory. They require simplejson and can optionally be run under the nose unit-test framework. New code should not be checked in if it regresses previously functional unit tests. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at Parser_tests.

HTMLParser

Profiling on web-apps.htm

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

  • utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
  • 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
  • 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
  • dataState in tokenizer.py with 0.7 CPU seconds is next.
  • __iter_ in tokenizer.py with 0.59x CPU seconds...
  • Creation of all node objects in web-apps takes .57x CPU seconds.
  • etc.