<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.whatwg.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Fantasai</id>
	<title>WHATWG Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.whatwg.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Fantasai"/>
	<link rel="alternate" type="text/html" href="https://wiki.whatwg.org/wiki/Special:Contributions/Fantasai"/>
	<updated>2026-04-05T03:53:03Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.39.3</generator>
	<entry>
		<id>https://wiki.whatwg.org/index.php?title=HTML5Lib&amp;diff=2436</id>
		<title>HTML5Lib</title>
		<link rel="alternate" type="text/html" href="https://wiki.whatwg.org/index.php?title=HTML5Lib&amp;diff=2436"/>
		<updated>2007-08-15T16:12:28Z</updated>

		<summary type="html">&lt;p&gt;Fantasai: Undo revision 2420 by Special:Contributions/QvzNdc (User talk:QvzNdc)&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= HTML5Lib =&lt;br /&gt;
&lt;br /&gt;
[http://code.google.com/p/html5lib/ HTML5Lib] is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is &#039;&#039;&#039;not&#039;&#039;&#039; an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.&lt;br /&gt;
&lt;br /&gt;
== SVN ==&lt;br /&gt;
Please commit often with sort of detailed descriptions of what you did. If you want to make sure you&#039;re not going to redo ask on the [http://groups.google.com/group/html5lib-discuss mailing list].  For questions that could benefit from quick turnaround, talk to people on #whatwg.&lt;br /&gt;
&lt;br /&gt;
== General ==&lt;br /&gt;
&lt;br /&gt;
In comments &amp;quot;XXX&amp;quot; indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.&lt;br /&gt;
&lt;br /&gt;
In comments &amp;quot;AT&amp;quot; indicates that the comment documents an alternate implementation technique or strategy.&lt;br /&gt;
&lt;br /&gt;
== HTMLTokenizer ==&lt;br /&gt;
&lt;br /&gt;
The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.&lt;br /&gt;
&lt;br /&gt;
Currently tokens are objects, they will become dicts.&lt;br /&gt;
&lt;br /&gt;
=== Interface ===&lt;br /&gt;
&lt;br /&gt;
The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.&lt;br /&gt;
&lt;br /&gt;
=== Issues ===&lt;br /&gt;
* Use of if statements in the states may be suboptimal (but we should time this)&lt;br /&gt;
&lt;br /&gt;
== HTMLParser ==&lt;br /&gt;
&lt;br /&gt;
=== Profiling on web-apps.htm ===&lt;br /&gt;
&lt;br /&gt;
I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:&lt;br /&gt;
&lt;br /&gt;
* utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...&lt;br /&gt;
: We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.&lt;br /&gt;
:: That&#039;s an interesting idea. How would that work? [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)&lt;br /&gt;
::: I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it&#039;s worth it. [[User:Annevk|Annevk]] 11:37, 26 December 2006 (UTC)&lt;br /&gt;
&lt;br /&gt;
* 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds. &lt;br /&gt;
: I&#039;ve just switched to the built-in sets type. hopefully this will help a bit [[User:Jgraham|Jgraham]] 00:30, 25 December 2006 (UTC)&lt;br /&gt;
:: It did. (Not surprisingly when 700.000 method calls are gone...) [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)&lt;br /&gt;
&lt;br /&gt;
* 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.&lt;br /&gt;
: This is now the largest time consumer. [[User:Annevk|Annevk]] 12:49, 25 December 2006 (UTC)&lt;br /&gt;
&lt;br /&gt;
* dataState in tokenizer.py with 0.7 CPU seconds is next.&lt;br /&gt;
: This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. [[User:Annevk|Annevk]]&lt;br /&gt;
&lt;br /&gt;
* __iter__ in tokenizer.py with 0.59x CPU seconds...&lt;br /&gt;
&lt;br /&gt;
* Creation of all node objects in web-apps takes .57x CPU seconds.&lt;br /&gt;
&lt;br /&gt;
* etc.&lt;br /&gt;
&lt;br /&gt;
== Testcases ==&lt;br /&gt;
Testcases are under the /tests directory. They require [http://cheeseshop.python.org/pypi/simplejson simplejson]. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don&#039;t pass should not be checked in without both informing others on the [http://groups.google.com/group/html5lib-discuss mailing list] and a concrete plan.  Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at [[Parser_tests]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Category:Implementations]]&lt;/div&gt;</summary>
		<author><name>Fantasai</name></author>
	</entry>
</feed>