A user account is required in order to edit this wiki, but we've had to disable public user registrations due to spam.

To request an account, ask an autoconfirmed user on IRC (such as one of these permanent autoconfirmed members).

HTML5Lib

From WHATWG Wiki
Revision as of 19:51, 20 August 2007 by QvzNdc (talk | contribs)
Jump to: navigation, search

listen to holding out for a hero by bonnie le avventure di charles darwin sky italia keys gigi d alessio testi pete doherty biography un trailer e un sito pieni di f.e.a.r. iomega 160gb network hard disk anthias suzuki gsx r il cantante di jazz tirolo com offerta hotel roma video cristina aquilera negozzi animali oro-vi calcio inter bmw serie e90 la gestione delle risorse umane foto omicidi pc amd athlon 64 2800 campania mare essenza spa todo sobre los simpson bareback paggine gialle gold stark tastiera e mouse computer fullerton epson it riduttore segnale autoradio foto penetrazione amd 64 bit 3500 canon eos 20 der herr ist mit mir serena del grande fratello nuda kodak easyshare cx7525 renee o connor nuda ilpalermo calcio it segnale telecomando asus n6600gt td 128m a geforce 6600 256mb pci express creazione sito internet alassio tufillo la danza del vampiro dakar seagate barracuda 200 knauf ibook 12 apple velasco juan video di kissing you di des ree thrustmaster firestorm love don t cost a think braun silk epil concerto di zucchero dove e quando eutropio morte pompeo para llenarme vittorio zucconi la ballata del boia download inno reggina a marzo il pad saturn su playstation 2 justine il centro della passione www bmw mini it limosine john scofield quiet il triangolo delle bermuda professoresse calde cantrel spartiti buoni o cattivi nascita di una madre giacca i patto col diavolo lancia y elefantino blu 1998 dedica d amore samsung tv color stereo 29 flint c480 olympus crim epson stylus rx ddr 400 mhz 512 mb mmc dual voltage da 512mb telefonino nokia 6170 red tatoo volo per praga hd 20gb waltham dvdplayer convertitore mpc mp3 trieste cantico damore troia it http www 222z com mkt srl orecchino argento motorola cellulare porri ready 2 rumble boxing aspetti legali associazione no profit racconti collant sarica ad aware ad aware registratore vocale mp3 usb corso di formazione a gardaland pallone basket nike blu scorpion videocamera digitale jvc gf caterina iee trowbridge, john townsend soluzione ecommerce joey ramone lacie 160 usb caterina massari tony presidio www todo juego com www fimp it acer 19 lcd al1912s un medico in famiglia. prima serie. vol. 06 gabriella abate ritorno alla quarta dimensione il marzo di ripley camera vimodrone foto clisteri spiaggia zoom sigma 70-300 cradle hp cuffia dj sony bordo fly simulator rosi candy dulfer. live at montreux 2002 dilatati il popolo degli abissi libri epson d88 www costantino l italiano it denon avr-1906 automodelli elettrici gps palmare gprs gsm annunci con negri snoop dog e nelly costi ionita impreza subaru maurizio vandelli nokia 5125 casse jamo tuscany farmhouse rinascente happy fathers day pictures com dvd-r panasonic diiodo- tennis borsa www soap beautiful it cpminnj pellicce voli malta max pin agriturismo urbino wanda www musica rok com tuscany stones hard amateur log seca 2 spiaggie di cuba stranger paradise bad compani concessionario kia milano crema corpo kenzo prodotti www sexogratis com lacie mobile 30gb sara ass severina 2004 riduzione aliquote ire 2005 temi psicologia kit allarme casa registi signora con nero moonlight omer naturismo spagna attimi di paura apple mac x tiger shiseido matifying moisturizer oil-free swat3 cd key fotos de fantasmas glossario vino sacco, bruno programma traduzione milly d abbraccio cs cp18 nec341i simboli esoterici miso- darktown piccoli prestiti annuncio auto toscana concerto baglioni rasmuss sennheiser px 200 trasmettitore audio video telecomando civilization 3 dpr 560 1995 firmware thomson ascolta adagio di lara fabian = HTML5Lib =

HTML5Lib is a project to create both a Python-based and Ruby-based implementations of various parts of the WHATWG spec, in particular, a tokenizer, a parser, and a serializer. It is not an offical WHATWG project, however we plan to use this wiki to document and discuss the library design. The code is avaliable under an open-source MIT license.

SVN

Please commit often with sort of detailed descriptions of what you did. If you want to make sure you're not going to redo ask on the mailing list. For questions that could benefit from quick turnaround, talk to people on #whatwg.

General

In comments "XXX" indicates something that has yet to be done. Something might be wrong, has not yet been written and other things in that general direction.

In comments "AT" indicates that the comment documents an alternate implementation technique or strategy.

HTMLTokenizer

The tokenizer is controlled by a single HTMLTokenizer class stored in tokenizer.py at the moment. You initialize the HTMLTokenizer with a stream argument that holds an HTMLInputStream. You can iterate over the object created to get tokens back.

Currently tokens are objects, they will become dicts.

Interface

The parser needs to change the self.contentModelFlag attribute which affects how certain states are handled.

Issues

  • Use of if statements in the states may be suboptimal (but we should time this)

HTMLParser

Profiling on web-apps.htm

I did some profiling on web-apps.htm which is a rather large document. Based on that I already changed a number of things which speed us up a bit. Below are some things to consider for future revisions:

  • utils.MethodDispatcher is invoked way too often. By pre declaring some of it in InBody I managed to decrease the amount of invocatoins by over 24.000, but InBody.__init__ is invoked about 7000 times for web-apps.htm so that amount could be higher. Not sure how to put them somewhere else though. First thing I tried was HTMLParser but references get all messed up then...
We should be able to store a single instance of each InsertionMode rather than creating a new one every time the mode switches. Hopefully we have been disiplined enough not to keep any state in those classes so the change should be painless.
That's an interesting idea. How would that work? Annevk 12:49, 25 December 2006 (UTC)
I got an idea on how it might work and it worked! Still about 3863 invocations to utils.MethodDispatcher but it takes 0.000 CPU seconds. I suppose we can decrease that amount even more, but I wonder if it's worth it. Annevk 11:37, 26 December 2006 (UTC)
  • 713194 calls to __contains__ in sets.py makes us slow. Takes about 1.0x CPU seconds.
I've just switched to the built-in sets type. hopefully this will help a bit Jgraham 00:30, 25 December 2006 (UTC)
It did. (Not surprisingly when 700.000 method calls are gone...) Annevk 12:49, 25 December 2006 (UTC)
  • 440382 calls to char in tokenizer.py is the runner up with 0.8x CPU seconds.
This is now the largest time consumer. Annevk 12:49, 25 December 2006 (UTC)
  • dataState in tokenizer.py with 0.7 CPU seconds is next.
This is now at 0.429 CPU seconds. Probably becase the tokenizer switched to dicts instead of custom Token objects. Annevk
  • __iter__ in tokenizer.py with 0.59x CPU seconds...
  • Creation of all node objects in web-apps takes .57x CPU seconds.
  • etc.

Testcases

Testcases are under the /tests directory. They require simplejson. New code should not be checked in if it regresses previously functional unit tests. Similarly, new tests that don't pass should not be checked in without both informing others on the mailing list and a concrete plan. Ideally new features should be accompanied by new unit tests for those features. Documentation of the test format is available at Parser_tests.