Technical: October 2004 Archives
I am a fan of standards ie XHTML Transitional/Strict etc. To this end I do try to make sure that I am keeping my own sites reasonably compliant. Sites I do commercially are always 100% compliant but thats because I insist on it and they have placed their trust in me.
Just recently I have had to convert a really bad site to XHTML Transitional and if you had seen the markup you would have realized how big this task was. To go through it by hand would have been an enormous task and quite frankly I would have been unable to do it at the price I quoted without the following tools:
The first tool (Vim) could really be any good text editor ie Emacs, ed, or any of the vi children. I just happen to use Vim and once you have learned the basics joy to use and makes editing text almost an art.
TT2! the second tool is slightly more specialized and less well known but just as easy to use, but it deserves a big mention. TT2 is a templating system. Most people won't really understand or even need to know what the advantages of this is until they need to edit a 10+ page website and hate it when someone wants to change a font on some item on all the pages. This could of course be done using server side includes or some other method but TT makes this easy but also exposes a programmatic API which make its functionality and versatility as wide as the programmers skills. This only scratches the surface of what TT can actually do for you.
The third tool is Dave Raggets HTML Tidy. This one tools is what saved me from going stark raving mad this weekend. Visually selecting an area in vim and then
'<,'>!tidy -asxhtml -icbq -wrap 100
was what kept me sane. This single command will take ANY html fragment and sanitize it for you. It adds a lot of guff that you may not want but you can remove that and you have a sanitized version complete with CSS.
I just wanted the formatting, indenting and validation. I weeded out the CSS and I was left with a nice plain HTML document that I was then able to understand rather than some debauchery of a mess the devil would not have started with.
Using Tidy this way is a great way to get a clear place to start when converting a messy HTML page.
Last but not least is the W3C's validator pages for both CSS and XHTML. After all the grunt work is over its time to check the pages and using the methods above I managed to come in with:
Out of 29 Pages:
20 html errors
2 css errors
this took me about 30 minutes to fix!
I'm fairly lazy when it comes to validating my own site. I mean, who can be arsed making edits and then validating them all every time ;)
I know there are plenty of people who do it but I am not one of them. I normally check to make sure that it looks OK and thats about it. I am not even that concerned about displaying in Internet Explorer ( I have minimal real visitors a month and the rest is blog spam touting Viagra ). This is because I use Debian almost exclusively at work and at home and it is a major pain in the ass to check the windows side of things.
What I have tried to do is be quite strict with myself when I am making edits to my website. What this has resulted in is:
I checked 18 pages of my website and found 5 errors (all silly) all of which were on one page and caused by character references.
For those that have used the W3C validator this is not bad going at all. I know the purists will still think this is crap and that all HTML/XHTML should validate all the time. I believe this would be great too but unfortunately some of us have a life to lead outside the webosphere.
For those that always mean to get around to validating their websites but never do then my final word on HTML Validation is this:
"If you can't validate religiously, at least edit diligently"
How can I say this. Well it takes more skill to get it right first time than to correct it after you have been shown your mistakes!!!
Or at least my perception of it has been tainted by a website I am attempting to maintain that has been bolted together using dreamweaver. Note I did not use the word "constructed" or "built". I prefer bolted because its a mess.
Images everywhere. Every time a page was requested over 40 images were requested from the server. This is mad, on what appears to be a plain text website. with no adverts. 25% of the images happened to be used as 1 pixel spacers. This is absolute madness!
MAD MAD MAD BLODDY MAD
I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl's
if it doesn't match this it won't go in the lexicon. I know this is a bit harsh but unfortunately I don't have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.
I have noticed that a few people came here to find out information on what exactly a DSO Exploit is so I put together the following. If you need more leave me a comment and I will see what I can do.
Most of you are wondering why spybot is reporting a DSO Exploit. First, there is a bug in spybot at the moment that means it will always report this error. The bug will be fixed in a newer version of spybot.
Don't panic, your system may be as clean as a whistle.
What is a DSO Exploit.
DSO stands for Data Source Object. So a Data Source Exploit can be very severe when you consider your hard drive is a data source or pretty much anything else for that matter and can be accessed using a method called data binding. A DSO Exploit is where someone maliciously uses data binding techniques to gain access to material they are not meant to access. This was a bug in some versions of Internet Explorer, Outlook Express etc. Note I said old versions, the new versions no longer have this problem and I suggest you upgrade to these to avoid the bug.
This does not mean you have to go and buy the latest microsoft software. Microsoft release service packs that come with the necessary patches required to fix this problem so get the latest service pack for your system and install it and you will be safe from this particular bug, or at least until some smart arse finds another way to crack it
To stop SpyBot reporting the error do the following
Open SpyBot in advanced mode
Select: Ignore Products
On the "All Products" tab scroll to "DSO Exploit" and check it.
I am sick to death of spam! Its a pain in the ass but there does not seem to be much we can do about it.
One thing spammers do is harvest emails from the internet. This is surprisingly easy to do because people want to put their emails online and its very easy to write a spider. This is further compounded by the requirement of some applications that you put your email online ie Movable Type is one although it can be turned off but this then means I would get more spam.
One partial solution would be to store emails as a security image. I wrote this utility to create my own email images as png's. If other people find it useful then I will extend the functionality of it. I know it is not unbreakable as has been proven by those clever cloggs at Berkeley who have managed to crack the gimpy catchpa.
There are a few other methods to do this but the hardest ones to crack are those that use some form of Catchpa mine is almost there but has a bit to go. Any suggestions welcome.
Using a secure image is another way to make it harder for spammers to collect emails.
The image above was created by the email obfuscator tool.
The following email was not produced by me but by BobG. One of the guys who left a comment. I have displayed the image here because I don't want to allow images to be displayed in the comments otherwise we would give the blog spammers another avenue of attack. Anyway, Bob suggested I add the facility to create color for the background so I suppose I will have to do this over the next few days.
Ever since I started using Debian I have meant to try and create a debian package for various reasons,
a) I am just a curious bugger.
b) I am just a curious bugger.
today was my chance to have a go and see what exactly it involves or should I say what it involves to create a very minimal package.
The reason for this is that I have written a Content Management System that uses about 20 Perl modules some internal some external and rather than worry when it comes to the install or an upgrade we have decided to stick the whole thing in subversion and then wrap each release in a debian package. We have several sites to run this from so the more we can automate the better particularly if I can get the automatic testing sorted. Using subversion and the debian packages the whole system should be relatively low maintainance as far as upgrades are concerned and this is important. We don't want to scrub ourselves into the upgrade corner and find we have neither the time or the budget to spend time on an upgrade. We want it automated for us and although it might be a pain in the arse to put in place it will pay dividends when we come to change things later.
We also have a postgres schema and some config files that need taking care of but as I found out today this is relatively simple using the debian packaging tools.
I suppose I should write a simple howto about how I did it because using the debian new maintainers guide is not really the best tutorial for those wishing to package their own application for internal use. I imagine there are some other tutorials around but I didn't find them.
I have not been able to do any work on thesearch engine for quite a while due to commitments with maths etc but I now have some free time so I have restarted spidering again.
What I am aiming for is about 100 million pages as a base to start working on. I am probably going to impliment the whole thing using Postgres because I do not have the time to write the software requred to handle the storage (the files are stored as flat files on disk) its the meta data that I will be storing in Postgres. I will let Postgres do all the nitty gritty work so that I can concentrate on the ranking and search algorithms.
I am also looking at just using plain text ie splitting out the html completely and rely in the text and not the formatting of the document to rank each document. The reasons for this are:
a) It is much much simpler. I started writing an HTML parser in flex and believe me its a pain in the ass.
b) Plain text is also where the information is and it is this that I am interested in. Dealing with the formatting is not something I want to have to deal with. I intend to store each document raw to disk incase I change my mind later though ;)
There appears to be some misunderstanding surrounding the usage of the robots.txt file.
The following is just a fraction of the stuff I have found while spidering websites.
The "noarchive" statement should be part of a meta tag it should not be in the robots.txt file. Its not part of the standard.
I believe that the following or something similar should be in the standard but it isn't yet ie "Crawl-delay".
It is implemented by a few crawlers but people insist on doing the following
The proper way should be as follows,
I know Yahoo's crawler (Slurp) adheres to the Crawl-delay directive
but here we are endorsing a non-standard method, whether this is a good or bad thing is left up to the reader to decide. I think there needs to be a delay type option in the robots.txt file having been hammered once by msn's bot.
Then we have the people who think that they need to authorize a spider to spider their website.
The reason for not having an Allow directive is simple. Hardly any of the internet would be indexed becasue only a fraction of the websites online actually uses the robots.txt file. By implementing an Allow directive it would mean that websites are closed for business to the spiders. For instance, take the following directive
is the spider then to assume that only that directory is available to the spider on the entire website, what can the spider assume about the above directive. To me it reads that only the "index_me" directory is to be indexed. What then is the point in the Disallow directive.
The Disallow directive was chosen because the internet is for all intents and purposes a public medium so we all opt in when we put our websites up then we opt out of the things we don't want indexed.
my favorite though are the following. The honest mistakes
I had heard of Prof. Lessig from general browsing on the internet so I know he's got some clout with the online community, blogosphere whatever you want to call it but I had never really taken the time to find out what he does that seems to cause such a stir. He seems to have an almost religious following in some circles so I thought that I should go and see just exactly what all the fuss is about.
I had heard of the Creative Commons before so off I went umbrella in hand to University College London's Edward Lewis Theatre and grabbed myself a seat. I immediately recognized him because I had visited his website before I went to the talk for a general nosey.
This is just the way I heard it I am sure I have probably got some of the ideas and concepts wrong ;)
I loved the way he started his talk ie he took us back to the days when George Eastman was setting about pioneering the camera and how a law passed then enabled the camera business to flourish the way it did. He then described a few things that we take for granted ie cultural remix (first time I had heard this phrase), the act of taking something like a song and putting your own spin on it or having watched a movie how we describe it to our friends and embellish it the way we see it. This goes on every day and there are no copyrights on this and there shouldn't be.
He then moved this onto the digital age and casually pointed out that our cultural remix which we take for granted every day was now, in part, a digital phenomenon and no longer limited by distance. Kids today are growing up in this digital age and are making friends across the world without even meeting up so our once limited cultural remix has set new boundaries on a global scale. The way we think eat and speak and go about our business is now wrapped up online in this huge boiling ménagerie of digital stuff. People are expressing themselves in ways we would not have dreamed about a few years ago ie we have a new age cultural remix going on and this is a good thing. What is not good is that we have the middle men ie the lawyers trying to stifle this from happening. The lawyers and some corporations are doing this by making vast areas of our new remix illegal. ie
"Using DDT to kill a gnat"
(from memory, used by Prof Lessig in the talk, probably slightly misquoted)
was the way Prof Lessig described it and this is wrong. It was quite clear that Prof Lessig believes in copyright and so do I but it was also clear that he does not believe in applying it blindly. The normal bluderbuss approach to copyright seems to get his goat and quite rightly so, its bloody stupid.
Anyway, the talk centered around the creative commons license and what it means to us and what we can use it for and why we need it.
At the moment everything written down is copyright to the author or creator of it regardless of whether they have stuck the big C on it somewhere. This means that everything on the internet is expressly copyright unless stated otherwise. For people who want to use something they find on the internet ie a DJ finding a sample from a song, they cannot unless they have permission from the owner of the copyright so they have to get lawyers (middlemen) involved to sort out the legal stuff and they can carry on with their mixing. What the creative commons enable us to do is release a piece of work and mark it so that people know what they can and cannot do with it without having to get the lawyers involved ie cutting out the middleman. I am all for this, its a wonderful idea.
Can I prove that its a wonderful idea, yes I can. During the talk Prof Lessig played part of a soundtrack that had been released under the creative commons license "My Life" by Colin Mulcher which was then edited by Cora Beth and the editing certainly added something to the track. It was brilliant. This is not an isolated incident either.
Anyway, I have just found some of the material from the talks online so your time would be better spent watching these flash movies than reading this.
You might also be interested in
Some people really are missing the point when trying to use RSS to list jobs. I have noticed several sites posting the title of the job and a link but there is absolutely no description. I have seen others posting a few words of the description.
This is absolutely useless becasue it contains hardly any useful information for the person finding the information, this is assuming they find it at all. Personnally I won't add a feed to UKlug unless there is a description with some helpful text.
I was using the gojobsite feeds for UKlug but for some reason the feeds seem to have went out of date. I can only presume that the techs at gojobsite decided that it would not be worth their while to keep them.
I have sent a couple of emails to see if it would be possible to get the feed back because out of the Uk job sites gojobsite seems to have some of the better adverts and I liked using it on UKlug.
Unfortunately I have had no reply at all from them. Shame!