A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on Chat (such as one of these permanent autoconfirmed members).

HTML5Lib: Difference between revisions

From WHATWG Wiki
Jump to navigation Jump to search
(tokeniser -> tokenizer + adding an issue)
Line 1: Line 1:
=HTML5Lib=
=HTML5Lib=


HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokeniser and parser. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.
HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is '''not''' an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.


==Project Page==
==Project Page==
Line 8: Line 8:
==Design Issues==
==Design Issues==


===Tokeniser===
===Tokenizer===


The tokeniser is controlled by a single Tokeniser class. It emits tokens which are subclasses of the Token class. At some level it needs to be possible to store information about the token being emitted as the tree builder sometimes need to reprocess a token, however there may be some opportunity to economize here.
The tokenizer is controlled by a single Tokenizer class. It emits tokens which are subclasses of the Token class. At some level it needs to be possible to store information about the token being emitted as the tree builder sometimes need to reprocess a token, however there may be some opportunity to economize here.


====Interface====
====Interface====
Line 16: Line 16:


====Description====
====Description====
The current tokeniser state is stored in the state attribute. Each state corresponds to a single method on the Tokeniser object. The states attributes maps state names to methods.
The current tokenizer state is stored in the state attribute. Each state corresponds to a single method on the Tokenizer object. The states attributes maps state names to methods.


Occasionally, parsing will result in characters being read multiple times from the stream. Therefore a queue is used to buffer characters that have been read from the file but need to be reprocessed (do we need to buffer multiple characters? Will support for a trailing slash change this?).
Occasionally, parsing will result in characters being read multiple times from the stream. Therefore a queue is used to buffer characters that have been read from the file but need to be reprocessed (do we need to buffer multiple characters? Will support for a trailing slash change this?).
Line 25: Line 25:
* Should keep track of line+col number for error reporting
* Should keep track of line+col number for error reporting
* Use of if statements in the states may be suboptimal (but we should time this)
* Use of if statements in the states may be suboptimal (but we should time this)
* Instead of having Token objects I think we should have HTMLParser be a subclass of HTMLTokenizer that implements the various methods HTMLTokenizer provides. (emitCommentToken, emitStartTagToken...) -- Annevk


[[Category:Implementations]]
[[Category:Implementations]]

Revision as of 19:25, 2 December 2006

HTML5Lib

HTML5Lib is a project to create a Python-based implementation of various parts of the WHATWG spec, in particular, a tokenizer and parser. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

Project Page

Google code project page

Design Issues

Tokenizer

The tokenizer is controlled by a single Tokenizer class. It emits tokens which are subclasses of the Token class. At some level it needs to be possible to store information about the token being emitted as the tree builder sometimes need to reprocess a token, however there may be some opportunity to economize here.

Interface

getToken - return the next token.

Description

The current tokenizer state is stored in the state attribute. Each state corresponds to a single method on the Tokenizer object. The states attributes maps state names to methods.

Occasionally, parsing will result in characters being read multiple times from the stream. Therefore a queue is used to buffer characters that have been read from the file but need to be reprocessed (do we need to buffer multiple characters? Will support for a trailing slash change this?).

During processing, any unfinished tokens are held in the token attribute and finished tokens in the tokenQueue.

Issues

  • Should keep track of line+col number for error reporting
  • Use of if statements in the states may be suboptimal (but we should time this)
  • Instead of having Token objects I think we should have HTMLParser be a subclass of HTMLTokenizer that implements the various methods HTMLTokenizer provides. (emitCommentToken, emitStartTagToken...) -- Annevk