Some numbers on harvested links
Some numbers on harvested links
links=# select count(*) from links_found;
count
---------
4159023
links=# select count(*) from home_page;
count
--------
851938
These are the new links. We can have multiple pages pointing to one link so I did not want to duplicate the work. These pages need to be checked to see if they are valid links then they will be downloaded and parsed.
links=# select count(*) from home_page where state between 1 and 500;
count
--------
134172
I generate states for pages that do not exist or where permission denied etc. I do this by running another process that only requests the headers of each of the files before I download the file itself. It saves a lot of time doing it this way and reduces overall hits on sites. This is a count of searched links errors and all
links=# \i sbin/intersect_report.sql
count
-------
58088
This show how many pages that exist and that I have sucessfully
downloaded and parsed.
links=# select count(*) from home_page where state = 500;
count
-------
62379#
These are confirmed good links to be searched next.
I have been concentrating on Maths for the past few days so so the robots are taking a back seat.






Leave a comment