Harry: October 2004 Archives
I am a fan of standards ie XHTML Transitional/Strict etc. To this end I do try to make sure that I am keeping my own sites reasonably compliant. Sites I do commercially are always 100% compliant but thats because I insist on it and they have placed their trust in me.
Just recently I have had to convert a really bad site to XHTML Transitional and if you had seen the markup you would have realized how big this task was. To go through it by hand would have been an enormous task and quite frankly I would have been unable to do it at the price I quoted without the following tools:
The first tool (Vim) could really be any good text editor ie Emacs, ed, or any of the vi children. I just happen to use Vim and once you have learned the basics joy to use and makes editing text almost an art.
TT2! the second tool is slightly more specialized and less well known but just as easy to use, but it deserves a big mention. TT2 is a templating system. Most people won't really understand or even need to know what the advantages of this is until they need to edit a 10+ page website and hate it when someone wants to change a font on some item on all the pages. This could of course be done using server side includes or some other method but TT makes this easy but also exposes a programmatic API which make its functionality and versatility as wide as the programmers skills. This only scratches the surface of what TT can actually do for you.
The third tool is Dave Raggets HTML Tidy. This one tools is what saved me from going stark raving mad this weekend. Visually selecting an area in vim and then
'<,'>!tidy -asxhtml -icbq -wrap 100
was what kept me sane. This single command will take ANY html fragment and sanitize it for you. It adds a lot of guff that you may not want but you can remove that and you have a sanitized version complete with CSS.
I just wanted the formatting, indenting and validation. I weeded out the CSS and I was left with a nice plain HTML document that I was then able to understand rather than some debauchery of a mess the devil would not have started with.
Using Tidy this way is a great way to get a clear place to start when converting a messy HTML page.
Last but not least is the W3C's validator pages for both CSS and XHTML. After all the grunt work is over its time to check the pages and using the methods above I managed to come in with:
Out of 29 Pages:
20 html errors
2 css errors
this took me about 30 minutes to fix!
I'm fairly lazy when it comes to validating my own site. I mean, who can be arsed making edits and then validating them all every time ;)
I know there are plenty of people who do it but I am not one of them. I normally check to make sure that it looks OK and thats about it. I am not even that concerned about displaying in Internet Explorer ( I have minimal real visitors a month and the rest is blog spam touting Viagra ). This is because I use Debian almost exclusively at work and at home and it is a major pain in the ass to check the windows side of things.
What I have tried to do is be quite strict with myself when I am making edits to my website. What this has resulted in is:
I checked 18 pages of my website and found 5 errors (all silly) all of which were on one page and caused by character references.
For those that have used the W3C validator this is not bad going at all. I know the purists will still think this is crap and that all HTML/XHTML should validate all the time. I believe this would be great too but unfortunately some of us have a life to lead outside the webosphere.
For those that always mean to get around to validating their websites but never do then my final word on HTML Validation is this:
"If you can't validate religiously, at least edit diligently"
How can I say this. Well it takes more skill to get it right first time than to correct it after you have been shown your mistakes!!!
I be you all though that Star Trek and the Borg was some pipe dream. Well not any more.
I am sure people are wondering why anyone would want to wear such a contraption but surely we would have said that about what is now the humble mobile phone earpiece just a few years ago. Its coming:
RESISTANCE IS FUTILE!!!
YOU WILL BE ASSIMILATED!!!
and they are seeing sex in everything. Including Fruity sweetie wrappers
This made me howl. If someone had not pointed it out to me I would never have noticed it but it would appear that the Graduates from St Blasien Jesuit College, near Freiburg are seeing sex in everything. It sounds to me like they are the perverts if they are able to see two imaginary characters on a sweetie wrapper having sex!
Sometimes some people just go to far!
Or at least my perception of it has been tainted by a website I am attempting to maintain that has been bolted together using dreamweaver. Note I did not use the word "constructed" or "built". I prefer bolted because its a mess.
Images everywhere. Every time a page was requested over 40 images were requested from the server. This is mad, on what appears to be a plain text website. with no adverts. 25% of the images happened to be used as 1 pixel spacers. This is absolute madness!
MAD MAD MAD BLODDY MAD
The subject heading of this entry sounds a bit mad dosn't it. I mean, who the hell would believe that the position of the moon could possibly affect the outcome of an election. It dosn't, but there are those characters who are basing their election decision on the design of Mr Kerry and Mr Bush's website. Now isn't that fscked up. For those that don't believe me take yourself over to slashdot and have a look around......
Does this mean we are seeing the entrance of the designer website. I can see it now
1. Websites by Gucci
2. Menu's by Prada
3. Footers by Nike
Or, as a dialogue!
"Ohhh, love your hit counter"
"Yeah! we got Armani in to do it, worth every penny!"
Who the hell could possibly be that shallow?
Wait, we have Hello magazine, Cosmopolitan ( feminist trash ), FHM and Eurotrash that answers that question, the brain dead.
As far as I am aware he now prevents foreign users from visiting and viewing his website. I am not making this up. Unless you are on a North American ip range you are forbidden from viewing his website.
This is the most powerful man on the planet who has more affect on foreign governments and their economies than some of the local governments do yet if you ain't American ( an infidel ) you are not allowed to view his website.
The reason for this dumb ass decision is apparently due to his website getting cracked a few times. Do they really think that banning mass IP ranges is going to stop a real cracker, bollix. Its not hard to crack another PC from inside their borders then launch from there.
All this episode has done is made him and his administration look like people who don't care about us foreigners. But then, why should he care now he hasn't really given a damn before.
I hate getting involved in politics but some things are just too dumb to abstain from commenting on them.
I seen the blog shares website several months ago before I had a blog and wondered what it was all about. I have just noticed that I have an entry on it
Mad. So what makes my blog worth more money then ;) ?
I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl's
if it doesn't match this it won't go in the lexicon. I know this is a bit harsh but unfortunately I don't have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.
I got a text message today from someone and the phone number was displayed as
bob7ware or b0b7ware
weird. I can only assume it's spam.
I have been receiving some spam from some dodgy company with a text message saying that someone close to me fancies me dial this number. Why does our government tolerate this type of nonsense. Its harrassment on a global scale and now that its working its way onto the mobile phone we are not going to be able to get away from the crap.
I have noticed that a few people came here to find out information on what exactly a DSO Exploit is so I put together the following. If you need more leave me a comment and I will see what I can do.
Most of you are wondering why spybot is reporting a DSO Exploit. First, there is a bug in spybot at the moment that means it will always report this error. The bug will be fixed in a newer version of spybot.
Don't panic, your system may be as clean as a whistle.
What is a DSO Exploit.
DSO stands for Data Source Object. So a Data Source Exploit can be very severe when you consider your hard drive is a data source or pretty much anything else for that matter and can be accessed using a method called data binding. A DSO Exploit is where someone maliciously uses data binding techniques to gain access to material they are not meant to access. This was a bug in some versions of Internet Explorer, Outlook Express etc. Note I said old versions, the new versions no longer have this problem and I suggest you upgrade to these to avoid the bug.
This does not mean you have to go and buy the latest microsoft software. Microsoft release service packs that come with the necessary patches required to fix this problem so get the latest service pack for your system and install it and you will be safe from this particular bug, or at least until some smart arse finds another way to crack it
To stop SpyBot reporting the error do the following
Open SpyBot in advanced mode
Select: Ignore Products
On the "All Products" tab scroll to "DSO Exploit" and check it.
I am sick to death of spam! Its a pain in the ass but there does not seem to be much we can do about it.
One thing spammers do is harvest emails from the internet. This is surprisingly easy to do because people want to put their emails online and its very easy to write a spider. This is further compounded by the requirement of some applications that you put your email online ie Movable Type is one although it can be turned off but this then means I would get more spam.
One partial solution would be to store emails as a security image. I wrote this utility to create my own email images as png's. If other people find it useful then I will extend the functionality of it. I know it is not unbreakable as has been proven by those clever cloggs at Berkeley who have managed to crack the gimpy catchpa.
There are a few other methods to do this but the hardest ones to crack are those that use some form of Catchpa mine is almost there but has a bit to go. Any suggestions welcome.
Using a secure image is another way to make it harder for spammers to collect emails.
The image above was created by the email obfuscator tool.
The following email was not produced by me but by BobG. One of the guys who left a comment. I have displayed the image here because I don't want to allow images to be displayed in the comments otherwise we would give the blog spammers another avenue of attack. Anyway, Bob suggested I add the facility to create color for the background so I suppose I will have to do this over the next few days.
Throwing someone out a window is not something I often do. In fact I cannot remember the last time I threw someone out/through a window. I imagine if I was to take part in such an event I would be able to remember it taking place. How do I know this? Well, throwing someone through a window is commonly known as defenestration, how could one forget such such a term.
I wonder how many Judges have been regaled about accounts of defenestration. I can see it now, some smart arse lawyer:
Well, your honor my client did not partake in the defenestration itself he was occupied with the carrying of the 3rd party, the defenestration took place after the defendee left the clients hands. My client has no knowledge of what happened to the defendant after the launch but can recount a crashing sound in or around the time of the event.
Ever since I started using Debian I have meant to try and create a debian package for various reasons,
a) I am just a curious bugger.
b) I am just a curious bugger.
today was my chance to have a go and see what exactly it involves or should I say what it involves to create a very minimal package.
The reason for this is that I have written a Content Management System that uses about 20 Perl modules some internal some external and rather than worry when it comes to the install or an upgrade we have decided to stick the whole thing in subversion and then wrap each release in a debian package. We have several sites to run this from so the more we can automate the better particularly if I can get the automatic testing sorted. Using subversion and the debian packages the whole system should be relatively low maintainance as far as upgrades are concerned and this is important. We don't want to scrub ourselves into the upgrade corner and find we have neither the time or the budget to spend time on an upgrade. We want it automated for us and although it might be a pain in the arse to put in place it will pay dividends when we come to change things later.
We also have a postgres schema and some config files that need taking care of but as I found out today this is relatively simple using the debian packaging tools.
I suppose I should write a simple howto about how I did it because using the debian new maintainers guide is not really the best tutorial for those wishing to package their own application for internal use. I imagine there are some other tutorials around but I didn't find them.
I have not been able to do any work on thesearch engine for quite a while due to commitments with maths etc but I now have some free time so I have restarted spidering again.
What I am aiming for is about 100 million pages as a base to start working on. I am probably going to impliment the whole thing using Postgres because I do not have the time to write the software requred to handle the storage (the files are stored as flat files on disk) its the meta data that I will be storing in Postgres. I will let Postgres do all the nitty gritty work so that I can concentrate on the ranking and search algorithms.
I am also looking at just using plain text ie splitting out the html completely and rely in the text and not the formatting of the document to rank each document. The reasons for this are:
a) It is much much simpler. I started writing an HTML parser in flex and believe me its a pain in the ass.
b) Plain text is also where the information is and it is this that I am interested in. Dealing with the formatting is not something I want to have to deal with. I intend to store each document raw to disk incase I change my mind later though ;)
There appears to be some misunderstanding surrounding the usage of the robots.txt file.
The following is just a fraction of the stuff I have found while spidering websites.
The "noarchive" statement should be part of a meta tag it should not be in the robots.txt file. Its not part of the standard.
I believe that the following or something similar should be in the standard but it isn't yet ie "Crawl-delay".
It is implemented by a few crawlers but people insist on doing the following
The proper way should be as follows,
I know Yahoo's crawler (Slurp) adheres to the Crawl-delay directive
but here we are endorsing a non-standard method, whether this is a good or bad thing is left up to the reader to decide. I think there needs to be a delay type option in the robots.txt file having been hammered once by msn's bot.
Then we have the people who think that they need to authorize a spider to spider their website.
The reason for not having an Allow directive is simple. Hardly any of the internet would be indexed becasue only a fraction of the websites online actually uses the robots.txt file. By implementing an Allow directive it would mean that websites are closed for business to the spiders. For instance, take the following directive
is the spider then to assume that only that directory is available to the spider on the entire website, what can the spider assume about the above directive. To me it reads that only the "index_me" directory is to be indexed. What then is the point in the Disallow directive.
The Disallow directive was chosen because the internet is for all intents and purposes a public medium so we all opt in when we put our websites up then we opt out of the things we don't want indexed.
my favorite though are the following. The honest mistakes
Is over. Thank god I have that out of the way. I now have 3 months before it all starts again and my scheduling is back at the mercy of the Open University.
I am not too sure how I did in the exam. I missed a couple of small bits out and I think I really botched one of the last ten point questions. I know what I needed to do for it but the notation completely escaped me, bollix.
We will see just how I did in a few months time. Now I need to figure out what would be the most beneficial thing to do with all this free time.
I just finished
From Here To Infinity
Author: Ian Stewart
This was tough going. It is not really suitable for people without a decent amount of maths and I have to say that a lot of the topics went clean over my head and I am meant to have a fair bit of maths beneath my belt or at least more then most. I would be hard pushed to recommend it because I did not feel as if I got as much from the book as I hoped. It would not put me off reading any of his other books I just don't think this book was my cup of tea.
I had heard of Prof. Lessig from general browsing on the internet so I know he's got some clout with the online community, blogosphere whatever you want to call it but I had never really taken the time to find out what he does that seems to cause such a stir. He seems to have an almost religious following in some circles so I thought that I should go and see just exactly what all the fuss is about.
I had heard of the Creative Commons before so off I went umbrella in hand to University College London's Edward Lewis Theatre and grabbed myself a seat. I immediately recognized him because I had visited his website before I went to the talk for a general nosey.
This is just the way I heard it I am sure I have probably got some of the ideas and concepts wrong ;)
I loved the way he started his talk ie he took us back to the days when George Eastman was setting about pioneering the camera and how a law passed then enabled the camera business to flourish the way it did. He then described a few things that we take for granted ie cultural remix (first time I had heard this phrase), the act of taking something like a song and putting your own spin on it or having watched a movie how we describe it to our friends and embellish it the way we see it. This goes on every day and there are no copyrights on this and there shouldn't be.
He then moved this onto the digital age and casually pointed out that our cultural remix which we take for granted every day was now, in part, a digital phenomenon and no longer limited by distance. Kids today are growing up in this digital age and are making friends across the world without even meeting up so our once limited cultural remix has set new boundaries on a global scale. The way we think eat and speak and go about our business is now wrapped up online in this huge boiling ménagerie of digital stuff. People are expressing themselves in ways we would not have dreamed about a few years ago ie we have a new age cultural remix going on and this is a good thing. What is not good is that we have the middle men ie the lawyers trying to stifle this from happening. The lawyers and some corporations are doing this by making vast areas of our new remix illegal. ie
"Using DDT to kill a gnat"
(from memory, used by Prof Lessig in the talk, probably slightly misquoted)
was the way Prof Lessig described it and this is wrong. It was quite clear that Prof Lessig believes in copyright and so do I but it was also clear that he does not believe in applying it blindly. The normal bluderbuss approach to copyright seems to get his goat and quite rightly so, its bloody stupid.
Anyway, the talk centered around the creative commons license and what it means to us and what we can use it for and why we need it.
At the moment everything written down is copyright to the author or creator of it regardless of whether they have stuck the big C on it somewhere. This means that everything on the internet is expressly copyright unless stated otherwise. For people who want to use something they find on the internet ie a DJ finding a sample from a song, they cannot unless they have permission from the owner of the copyright so they have to get lawyers (middlemen) involved to sort out the legal stuff and they can carry on with their mixing. What the creative commons enable us to do is release a piece of work and mark it so that people know what they can and cannot do with it without having to get the lawyers involved ie cutting out the middleman. I am all for this, its a wonderful idea.
Can I prove that its a wonderful idea, yes I can. During the talk Prof Lessig played part of a soundtrack that had been released under the creative commons license "My Life" by Colin Mulcher which was then edited by Cora Beth and the editing certainly added something to the track. It was brilliant. This is not an isolated incident either.
Anyway, I have just found some of the material from the talks online so your time would be better spent watching these flash movies than reading this.
You might also be interested in
We broke the 2000 barrier today and we seem to have gained a new member today MarkF.
We seemed to have gained a little speed just recently which is good because we are now getting close to the more active teams which is were I can see our progress slowing down. What we really need are a few more processors to give us that extra boost so if you would like to joining a friendly bunch of folders then
look no further
I have just watched Raising Arizona for about the 3rd time and its still as funny as ever. All of the following are brilliant in it......
Nicolas Cage .... H.I. McDunnough
Holly Hunter .... Edwina 'Ed' McDonnough
Trey Wilson .... Nathan Arizona (Huffhines) Sr.
John Goodman .... Gale
William Forsythe .... Evelle
Sam McMurray .... Glen
Frances McDormand .... Dot
and of course.
Randall 'Tex' Cobb .... Leonard Smalls
We have now 19 rss jobs feeds in the UKlug database and I want more. Ideally I would like to get as many feeds as possible into the database but without screen scraping this might be difficult. If you know of a feed that I don't already have then please email me the details.
Some people really are missing the point when trying to use RSS to list jobs. I have noticed several sites posting the title of the job and a link but there is absolutely no description. I have seen others posting a few words of the description.
This is absolutely useless becasue it contains hardly any useful information for the person finding the information, this is assuming they find it at all. Personnally I won't add a feed to UKlug unless there is a description with some helpful text.
I was using the gojobsite feeds for UKlug but for some reason the feeds seem to have went out of date. I can only presume that the techs at gojobsite decided that it would not be worth their while to keep them.
I have sent a couple of emails to see if it would be possible to get the feed back because out of the Uk job sites gojobsite seems to have some of the better adverts and I liked using it on UKlug.
Unfortunately I have had no reply at all from them. Shame!