Weeding the database 12 Nov 03

You will see that the database has been reduced in size quite a bit. I have been running out of space so I decided to do some weeding. What I have done is fix all all the Url's that had a fragment part. Url's come in the following format.


The fragment part of the URL is not really required by us because it indicates a position in a document. This level od granularity is not required or any use to us, we are only interested in the document itself. I wrote a simple Perl script in conjunction with a Postgres Function to weed these out. During the process I deleted all links that where found by following the original URL with the fragment. This is what has led to the reduction in total links found. If you have a look at the latest robot code you will see that I now cater for this fragment art and strip it off before requestint the document.

55.0 Million links found
11.9 Million unique links found

Add to delicious Digg This Add to My Yahoo! Add to Google Add to StumbleUpon
| | Comments (0)

Leave a comment

About this Entry

This page contains a single entry by Harry published on November 12, 2003 12:25 AM.

Re-writing the spiders 08 Nov 03 was the previous entry in this blog.

Off to wales 20 Nov 03 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01