A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

Difference between revisions of "HTML5Lib"

From WHATWG Wiki
Jump to: navigation, search
m (HTML5Lib: removed unnecessary blank lines)
(hopefully it now more clearly indicates how things work)
Line 4: Line 4:
  
 
==Project Page==
 
==Project Page==
[http://code.google.com/p/html5lib/ Google code project page]
+
[http://code.google.com/p/html5lib/ Google code HTML5 Python parser project page]
  
==Design Issues==
+
==Design==
  
===Tokenizer===
+
===General===
  
The tokenizer is controlled by a single Tokenizer class. It emits tokens which are subclasses of the Token class. At some level it needs to be possible to store information about the token being emitted as the tree builder sometimes need to reprocess a token, however there may be some opportunity to economize here.
+
In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.
 +
 
 +
In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.
 +
 
 +
===HTMLTokenizer===
 +
 
 +
The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a parser argument. For this project that is represented by the HTMLParser. self.parser points to that. The parser has to implement the following methods in order to work:
 +
 
 +
* processDoctype(name, error)
 +
* processStartTag(tagname, attributes[])
 +
* processEndTag()(tagname, attributes[])
 +
* processComment(data)
 +
* processCharacter(data)
 +
* processEOF()
 +
 
 +
XXX: Perhaps we can remove the second argument of the processEndTag method.
  
 
====Interface====
 
====Interface====
getToken - return the next token.
 
  
====Description====
+
You need to invoke the "tokenize" method with a dataStream argument in order to start.
The current tokenizer state is stored in the state attribute. Each state corresponds to a single method on the Tokenizer object. The states attributes maps state names to methods.
 
  
Occasionally, parsing will result in characters being read multiple times from the stream. Therefore a queue is used to buffer characters that have been read from the file but need to be reprocessed (do we need to buffer multiple characters? Will support for a trailing slash change this?).
+
The parser can also change the self.contentModelFlag attribute which affects how certain states are handled.
 
 
During processing, any unfinished tokens are held in the token attribute and finished tokens in the tokenQueue.
 
  
 
====Issues====
 
====Issues====
Line 27: Line 38:
  
 
[[Category:Implementations]]
 
[[Category:Implementations]]
 
==Idea on how it can work together==
 
 
There's an HTMLParser class you can invoke with an object. What this object is can be decided later. File object, string, URI, etc. The newly created HTMLParser object then instantiates an HTMLTokenizer with itself as argument and the object. The HTMLTokenizer then invokes does things like parser.emitStartTagToken(name, ...) etc.
 
 
OK I've created a callback branch in svn with work in this direction. I don't know if it is going to work better or not... --[[User:Jgraham|Jgraham]] 00:53, 3 December 2006 (UTC)
 

Revision as of 07:01, 6 December 2006

HTML5Lib

HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

Project Page

Google code HTML5 Python parser project page

Design

General

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

HTMLTokenizer

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a parser argument. For this project that is represented by the HTMLParser. self.parser points to that. The parser has to implement the following methods in order to work:

  • processDoctype(name, error)
  • processStartTag(tagname, attributes[])
  • processEndTag()(tagname, attributes[])
  • processComment(data)
  • processCharacter(data)
  • processEOF()

XXX: Perhaps we can remove the second argument of the processEndTag method.

Interface

You need to invoke the "tokenize" method with a dataStream argument in order to start.

The parser can also change the self.contentModelFlag attribute which affects how certain states are handled.

Issues

  • Should keep track of line+col number for error reporting
  • Use of if statements in the states may be suboptimal (but we should time this)