HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.
Do not contribute material to this page unless you are happy for it to be released under the terms of the MIT license rather than the usual GFDL
The tokenizer is controlled by a single Tokenizer class. It emits tokens which are subclasses of the Token class. At some level it needs to be possible to store information about the token being emitted as the tree builder sometimes need to reprocess a token, however there may be some opportunity to economize here.
getToken - return the next token.
The current tokenizer state is stored in the state attribute. Each state corresponds to a single method on the Tokenizer object. The states attributes maps state names to methods.
Occasionally, parsing will result in characters being read multiple times from the stream. Therefore a queue is used to buffer characters that have been read from the file but need to be reprocessed (do we need to buffer multiple characters? Will support for a trailing slash change this?).
During processing, any unfinished tokens are held in the token attribute and finished tokens in the tokenQueue.
- Should keep track of line+col number for error reporting
- Use of if statements in the states may be suboptimal (but we should time this)
Idea on how it can work together
There's an HTMLParser class you can invoke with an object. What this object is can be decided later. File object, string, URI, etc. The newly created HTMLParser object then instantiates an HTMLTokenizer with itself as argument and the object. The HTMLTokenizer then invokes does things like parser.emitStartTagToken(name, ...) etc.
OK I've created a callback branch in svn with work in this direction. I don't know if it is going to work better or not... --Jgraham 00:53, 3 December 2006 (UTC)