A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

HTML5Lib

From WHATWG Wiki
Revision as of 01:36, 13 August 2007 by QvzNdc (talk | contribs)
Jump to: navigation, search

il moto cros pac eco 70 brachi- lg tv rz-20lz50 juliana lopes bbb4 panasonic viera 42pa50e maria paola jvc mg30 gino vannelli il canale degli angeli roper pcmcia rubeola agente cody banks bd spencer configurazione v3i veterinari ravenna dany brillant tel one di 614 volevo i pantaloni y tu mama tambien - anche tua madre copy body performance www cis it nokia 7610 imp collare cuoio lexmark inchiostro roksan giochi per motorola c550 sissoko let the music take control asus mypal a730 echinopsis ricetta riso regioni d italia batteria acer aspire 1680 kasai-occidental il sindacalista rossana hentai decatlon articolo sportivi attrice italiana g mystery island la bella otero super merk2 nuova toyota corolla benzina auto nuove multi-level governance usb adapter ipaq gioiellerie morellato a salerno hotel castello santonio staffa a muro annuncio cani galerias de mujeres ogni volta che vedo il mare unoaerre italia spa nokia 7710 grey pearl socket 478 i enrico vincenzi lexar compact flash 80x le sorprese dellamore ganggbang only big real great beautiful natural ti arcuri panorama cartoonnetvork sarcinella sevillana telegiornali sexi donne mature 50 60anni super furry animals. phantom power ippolito nino www liberoit lavatrice gettone p4 3 0 hp 710 shema israel artrosi cervicale junior jack robert smith kardia mou min anisihis losredondos psycho sistem of a down weston, edward (ingegnere) fifa99 com ixus 5 delta force2 mambomania thi anna e il re elisha cuthbert lettore karaoke divx sessuologo messina versilia albergo no tiene dinero sony radio portatile mi chiamo mimi dvd samsung vhs ati-sapphire radeon 9600 se canon - obiettivo zoom ef 28-135mm is www adecco it soundblaster live 5 1 dvd srink download sennheiser 30 film di dragonball z giovanna 41 sexy sexy photo www go party it magnum p.i. stagione 2 sarah kane libri majhon porno tine ci 1300 fb masterizzatore dvd per mac www kladi it supporto tv plasma parete maria cristina blu singoli usb motorola pc 220 brigata cani da birra link siti porrno banca mutuo casa shox 2 45 nike calzature uomo giacca lee ritenour. live in montreal ericsson z1010 montevarchi midi no me quiero enamorar bellisimi not that kind anastacia air silver hoobastanks rolf scarica copertina cd ggggggggggggggggggggggggggggggggg la donna dell amico mio testo carlos guardie giurate arius lcd televisori samsung cagliari immagini de ce ma minti remix studi consulenti fiscali trapani vendita fiori on line preamplificatori audio siglia di buffy annuncio privati incontri ufficio campobasso wd338 oregon scientific cordless jolie nuda giulia michelini trentino donne gente motore f.b.i. operazione pakistan buonanotte fiorellino de gregori tavoletta grafica wacom a6 muvo black voli israele last minute hd autoalimentato jenny garth capitan uncino bennato volo cesenatico hotel luino multifunzione laser color epson colori che esplodono master of orion 2 dancan impanatura lacie safe mobile hard drive 80gb di natale download gratis driver stampanti star wars la vendetta dei sith hp 715 plasma 42 pollici a parete tv televisori dsc h1 sony punto 2005 mi bouble bubbles alloggi venezia spille di sicurezza radioactiva maserati coupe cambiocorsa real doll com il porto del vizio tonino carotone festival crociera la serenissima 2004 c311xmi acer www bambola ramona bando forze armate camcorder congelatore verticale aeg epson rx600 dissipatori raffreddamento a liquido desenzano del garda art discoteca tastiere computer mp3 di slide along side escort palermo iracheno decapitato cerco donna a rimini jimmy eat world. believe in what you want draghixa dragonjar minipc auto noleggio chiavari frigorifero indesit = HTML5Lib =

HTML5Lib is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

SVN

Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the mailing list. For questions that could benefit from quick turnaround, talk to people on #whatwg.

General

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

HTMLTokenizer

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

Interface

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

Issues

  • Use of if statements in the states may be suboptimal (but we should time this)

HTMLParser

Profiling on web-apps.htm

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

  • utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
That's an interesting idea. How would that work? Annevk 12:49, 25 December 2006 (UTC)
I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. Annevk 11:37, 26 December 2006 (UTC)
  • 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
I've just switched to the built-in sets type. hopefully this will help a bit Jgraham 00:30, 25 December 2006 (UTC)
It did. (Not surprisingly when 700.000 method calls are gone...) Annevk 12:49, 25 December 2006 (UTC)
  • 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
This is now the largest time consumer. Annevk 12:49, 25 December 2006 (UTC)
  • dataState in tokenizer.py with 0.7 CPU seconds is next.
This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. Annevk
  • __iter__ in tokenizer.py with 0.59x CPU seconds...
  • Creation of all node objects in web-apps takes .57x CPU seconds.
  • etc.

Testcases

Testcases are under the /tests directory. They require simplejson. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the mailing list and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at Parser_tests.