Some numbers on harvested links

Some numbers on harvested links

links=# select count(*) from links_found;
count
---------
4159023

links=# select count(*) from home_page;
count
--------
851938

These are the new links. We can have multiple pages pointing to one link so I did not want to duplicate the work. These pages need to be checked to see if they are valid links then they will be downloaded and parsed.

links=# select count(*) from home_page where state between 1 and 500;
count
--------
134172

I generate states for pages that do not exist or where permission denied etc. I do this by running another process that only requests the headers of each of the files before I download the file itself. It saves a lot of time doing it this way and reduces overall hits on sites. This is a count of searched links errors and all

links=# \i sbin/intersect_report.sql
count
-------
58088

This show how many pages that exist and that I have sucessfully
downloaded and parsed.

links=# select count(*) from home_page where state = 500;
count
-------
62379#

These are confirmed good links to be searched next.

I have been concentrating on Maths for the past few days so so the robots are taking a back seat.

Add to delicious Digg This Add to My Yahoo! Add to Google Add to StumbleUpon
| | Comments (0)

Leave a comment

About this Entry

This page contains a single entry by Harry published on September 20, 2003 11:49 PM.

Multiple Robots was the previous entry in this blog.

Teaching Jenny Vim is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01