A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

Difference between revisions of "HTML5Lib"

From WHATWG Wiki
Jump to navigation Jump to search
(add some remarks about SVN and cleanup)
(interface changes)
Line 19: Line 19:

* processDoctype(name, error)
* processDoctype(name, error)
* processStartTag(tagname, attributes[])
* processStartTag(tagname, attributes{})
* processEndTag()(tagname, attributes[])
* processEndTag()(tagname)
* processComment(data)
* processComment(data)
* processCharacter(data)
* processCharacter(data)
* processEOF()
* processEOF()
XXX: Perhaps we can remove the second argument of the processEndTag method.


Revision as of 22:53, 7 December 2006


HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.


Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo work talk to people on #whatwg.



In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.


The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a parser argument. For this project that is represented by the HTMLParser. self.parser points to that. The parser has to implement the following methods in order to work:

  • processDoctype(name, error)
  • processStartTag(tagname, attributes{})
  • processEndTag()(tagname)
  • processComment(data)
  • processCharacter(data)
  • processEOF()


You need to invoke the "tokenize" method with a dataStream argument in order to start.

The parser can also change the self.contentModelFlag attribute which affects how certain states are handled.


  • Should keep track of line+col number for error reporting
  • Use of if statements in the states may be suboptimal (but we should time this)