Technical: June 2004 Archives
I have now added the facility for users to add jobs to their own web pages It's a simple operation of cutting and pasting two lines of HTML into a web page and you will get a feed from the database appearing. The feed can be customised with no knowledge of CSS. For those that know some css then you can customise the feed completely by writing your own CSS.
I have been meaning to throw together some thoughts about Google Page Rank etc for some time and I finally got around to it tonight.
I have been doing some more work on the HTML parser to see if I could improve the speed a bit. I decided to change yyin to read from a file rather than stdin and this has made quite a difference to the speed. It is now faster than HTML::Parser (but its not as functional or tidy)
My next task is to either find a good extendable hashing library for C or call C++ std libs from C. I have never had to do this before so it could be fun. I need to be able to use either C's equivalent to the C++ standard map and non standard hash.
So far I have not had much luck finding a C hash library that fits the bill..
I was writing a custom built parser for the search engine but Mark Fowler asked me why I was not using Yacc and friends. At the time any reference I had found to parsing HTML had said that Yacc was not the correct tool and although possible for strict html etc it does not fit well in the real world were most html is rubbish.
I decided to see what flex could offer instead and I can now safely say that flex is the exact tool that I was looking for, "thankyou Mark". Admitedly I am asking flex to do a lot more than just spit out tokens etc but what it does suits my needs . It gives me the ability to generate arbitrary code dependant upon state while parsing a document which is Exaclty what I need.
I wrote a simple parser using flex++ and then decided to see how performance would compare against the perl module HTML::Parser. Hands down HJTML::Parser is faster and not by any short margin. I have not looked into why this is. I am now going to write one in plain C and avoid C++ to see how much difference we can get.
Well! It took me about 5 minutes to change the flex++ parser back to a C parser and the performace improvement is quite drastic. I imagine this is because of the way flex++ is generating the lex.yy.cc file because C++ is only margianlly slower than C and some would argue it is faster.
I did a bit more playing around with the C version and I think I might be able to make it faster than the HTML::Parser version (of course it won't be half as functional).
Testing was done as follows
file big.html == 10Mb of reasonably formatted HTML
perl HTML::Parser (with XS) == 0.5s
flex == 0.7s
flex++ == 8s
I am no guru at flex so I imagine that it could be a lot quicker once I get to grips with it.