October 2003 Archives
The last couple of days have been spent sorting out some of the perl and C++. I have also expanded the stop list quite a bit. The Perl script that I was using to produce the file to build the term document matrix also got a bit of a working over.
I have increased the document list to 15700 which is still relatively small for an internet search engine but it is now a respectable amount of text to search for a small intranet site, like a small law firm. I will gradually increase this as I go along and testing to see what kind of results I get.
I have decided to write up what I have done with example code and put it on another few pages. Hopefully someone will be able to make some use of it.
Please see my:
Vector Space Search Engine
page for more details of what I have done to get this working, I am using the term working in its weakest sense here since I have been unable to test it properly yet.
83.1 Million links found
10.9 Million unique links found
I have spent the last few days trying to get the Vector Space Search engine running. The code is in a bit of a mess at the moment but, it's comming along. All I can say is thank god for the STL, without it I would been in for a hell of a job. I have now managed to create a Sparse Vector Space Matrix from 2397 documents. This needs to be increased before I can really start testing any weighting algorithms.
At the moment this is using 30Mb of memory. This is the max used during the entire process. I did have it running at 256Mb but this was my first round at designing the matrix. I then showed my program a copy of Knuth volume 3, it cowered in fear and its shoesize quickly dropped to a more respectable size. I am pretty sure that I could drop this even further by writing my own data structure without using the STL but I am happy with it at the moment.
I am not entirely happy with the output of the program yet becasue the inner product routine is not producing the correct output but this should be reletively easy to fix. I really need to do a review to make sure that I am not missing any points in my methodology.
I also need to try and compile a better stop list. The one I am using is not particularly good. This is a sure fire way of reducing the RAM footprint.
83.1 Million links found
10.9 Million unique links found
Well I managed to get to play with DOS 5 today to get the PC working, don't ask. I then had trouble trying to get the data off the old drive that was in the original PC because it insisted that 'it' was the operating system and that it was going to boot no matter where on the IDE channels you put it. It also seemed to ignore various jumper settings which was very odd. A quick format /mbr taught it who the boss was and we had no more trouble from the cheap seats.
After not having much success with www.realemail.co.uk due to spurious line disconnections we eventually got online with uklinux which was an ISP I used to use when Jason Clifford ran it. Unfortunately he is no longer concerned with it for various reasons, but he has started www.ukfsn.org which I can highly recommend.
I then downloaded ZoneAlarm and AVG for them. Unfortunately I forgot my Video cable or they could have been using a 19" Iiyama Vision Master instead of the old 14" jurassic valve driven thing they had been using previously.
35.876 Million links found
5.476 Million unique links found
After all the shuffling that I did two days ago to get the 20Gb disk out of the machine, I now need to check all the Postgres files to make sure that I have them in all the correct places and restart the robots to get the count up a little. Moving Postgres File
35.876 Million links found
5.476 Million unique links found
After much hunting for a library that I can use to impliment an LSI search engine I have had little luck. The library that seems to be the job for this is called SVDPACK it is written in Fortran and has been ported to C++. However, it has not been ported to the humble x86 architecture. It looks like I will have to run with writing the Vector Space search engine instead.
I managed to write a C++ routine to get the output of the Perl Term Document parser. This is a very simple parser that splits all the words in the document on whitespace. I know that there are reasons for not doing it like this so if I get time I will come up with a better method later but for now it will do.
My next task now is to take the input of the C++ program and create a Term Document Matrix from that that I can manipulate easily. I need to be able to carry out the following actions and quite a few more.
1. Count all occourances of each word in the entire document set.
2. Calculate mean values for each word.
3. Come up with some method to ran words in the document matrix. This is to avoid the typical abuses that you see where website saturate pages with keywords to try and manipulate the results of a search engine.
67.6 Million links found
8.529 Million unique links found
I have started on the Semantic Search Engine. I have downloaded 250Mb of pages to test with. I then constructed a partial (test) word list from this. The word list has the term frequency occourance and originating doc id. The words have all been stemmed to reduce overhead, I used the Lingua::Stem module for this. I will create a full word list tomorrow if I get time. I also need to find a decent library in C++ becasue I don't fancy writing my own Singular Value Decomposition library (if you know what sort of maths would be involved in doing this you also know that I am not at that level, yet! ;-). I also think Perl may be a bit slow for what I am trying to do although I am always willing to give it a try and see what happens.
67.6 Million links found
8.529 Million unique links found
I just noticed a few sites that offer link checking on a commercial basis. Does anyone know if there is any money in this because it is a surprisingly easy thing to do. In one month I managed to check the links on over 500,000 pages and produce error codes for every link. This is from a very humble machine and across the internet. If I was on an intranet with a 100Mb connection onto the webserver I could trawl an entire site at quiet times in no time at all. Admittedly my tool is not fully automated yet but one place was charging $7500 to do a 50,000 page website. There must be more to it than what I am doing ;-)
Latent Semantic Search Engines
After my exams I am going to play with building a search engine. I want to do this purely in an academic capacity. For those search engine guru's out there please keep the following in mind. I only started reading about this stuff in the last few days so there are probably several gaping holes in my descriptions below. I will correct anything anyone sends to me.
I know someone has already built a vector space engine in Perl and there has been an article wrote about it but I would like look at LSI and how to go about building a serach engine using it. I also know about not re-inventing the wheel but I learn by doing. I will probably use some of the code from the Perl article where I do I will mention this and any changes I make.
Basically:
1. Reduce a document using a stop list. I might go ethical here ;-)
2. Removing all words that have no semantic meaning.
3. Removing words that appear in every document.
4. Removing words that appear in one document only.
The LSI model stores document representations in a matrix called a "Term Document Matrix". The LSI model does not care where the word appears in the docuement. I personally think that this is a good indicator of how relevant multiple words in a search string are to each other. After the TDM has been made, and a result set found further ranking can be given to documents based on their location with respect to each other in the document.
LSI uses a term space that has many thousands of dimension (as many as you have documents). THis term space can be reduced using a mathematical method called Singular Value Decomposition (SVD) to collapse this high dimensional space into more manageable numbers and in the process the documents with semantically similar meanings are are further grouped together.
I really need to read more Knuth to see if I can find some pointers on how best to go about building a structure that can be easily manipulated etc. For development purposes I will probably do most of the spidering and preprocessing using Perl, its striing manipulation is second to none. When it comes to matrix manipulation and some heavy computational bits that I am expectin to see I am not sure what kind of performance hit I might get with using Perl so I will use C++ for that part of it if I can find a library for it, if not I might try a different method.
Anyway, enough idle rambling I am off to revise some Maths.
52.18 Million links found
7.17 Million unique links found
I am in the process of reducing the total size of the database and increasing performance a bit in the process. As it stands the links_found table is holding duplicate copies of the home_page table. This was necessary because I wanted to be able to see what links belong to what web pages at the start for testing purposes. I no longer need to do this. What I am going to change is the format that they are stored in. I am going to make the following changes to the link_found table.
FORM
links_found
(
parent_url_id int4,
found_url varchar(2000)
);
TO
child_links
(
parent_url_id int4,
found_url int4
);
I have written a postgres function to carry out the migration. As you can see beneath the space savings are fantastic. I should also see an improvement in my indexes on this table. I am trying very hard to postpone buying any extra hardware. It will make it a bit more awkward to use the data now but this is not my main priority at the moment. When I have lots of disk space I can the create temporary tables for any manipulations as I require them.
You can see below that I have created a new table called child_links. I have converted all the URL's into url_id's which are "int4" types. I have also created indexes on this table. I can now remove all relations relating to the links_found table.
links# select relname, relfilenode, relpages from pg_class order by relpages limit 10;
relname | relfilenode | relpages
-------------------+-------------+----------
links_found | 188163825 | 644114
links_found_pkey | 246168688 | 588185
lf_found_url_idx | 246168682 | 559585
child_links | 246168690 | 255817
child_links_pkey | 246168692 | 216185
home_page | 188163817 | 118338
parent_url_id_idx | 299992353 | 116508
child_url_id_idx | 299992259 | 116231
home_page_pkey | 246168684 | 103223
home_page_url_key | 246168686 | 100120
hp_url_id_index | 246168683 | 15857
hp_url_id_idx | 301324542 | 15660
(12 rows)
Before I remove any of the relations I wanted to see my actual disk savings.
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3125 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 1466 6551 19% /links/pg_xlog
/dev/hdb1 9628 5523 3616 61% /links/tables
/dev/hdb2 9605 7044 2073 78% /links/temp
/dev/sda5 17364 13812 2684 84% /links/database
So that you can see what I had done to the files and where they where all pointing here is a listing of the links database directory for all the big relations.
-rw------- 1 postgres postgres 8.0k Oct 12 03:54 188163815
lrwxrwxrwx 1 postgres postgres 33 Oct 10 23:17 188163817 -> /links/pg_xlog/postgres/188163817
lrwxrwxrwx 1 postgres postgres 30 Oct 10 22:35 188163825 -> /links/temp/postgres/188163825
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:37 188163825.1 -> /links/temp/postgres/188163825.1
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:38 188163825.2 -> /links/temp/postgres/188163825.2
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:39 188163825.3 -> /links/temp/postgres/188163825.3
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:41 188163825.4 -> /links/temp/postgres/188163825.4
lrwxrwxrwx 1 postgres postgres 32 Oct 10 23:30 246168682 -> /links/tables/postgres/246168682
lrwxrwxrwx 1 postgres postgres 34 Oct 10 23:44 246168682.1 -> /links/tables/postgres/246168682.1
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.2 -> /links/tables/postgres/246168682.2
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.3 -> /links/tables/postgres/246168682.3
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:42 246168682.4 -> /links/tables/postgres/246168682.4
-rw------- 1 postgres postgres 132M Oct 12 03:54 246168683
-rw------- 1 postgres postgres 855M Oct 12 03:55 246168684
-rw------- 1 postgres postgres 871M Oct 12 03:55 246168686
-rw------- 1 postgres postgres 1.0G Oct 11 01:59 246168688
-rw------- 1 postgres postgres 1.0G Oct 11 01:38 246168688.1
-rw------- 1 postgres postgres 1.0G Oct 11 01:54 246168688.2
-rw------- 1 postgres postgres 1.0G Oct 11 01:59 246168688.3
-rw------- 1 postgres postgres 499M Oct 11 01:59 246168688.4
-rw------- 1 postgres postgres 1.0G Oct 11 14:32 246168690
-rw------- 1 postgres postgres 1.0G Oct 12 00:42 246168690.1
-rw------- 1 postgres postgres 52M Oct 12 03:55 246168690.2
-rw------- 1 postgres postgres 1.0G Oct 12 03:52 246168692
-rw------- 1 postgres postgres 750M Oct 12 03:55 246168692.1
-rw------- 1 postgres postgres 1005M Oct 12 03:55 299992259
-rw------- 1 postgres postgres 995M Oct 12 03:55 299992353
-rw------- 1 postgres postgres 130M Oct 12 03:55 301324542
After droping the relations and deleting all trace of them I had the following results. You may notice that the filenames are completely different. I had a major problem on the SCSI disk again. I had to recreate the database using initdb because some of my pg_clog files went missing. To allow me to complete a vacuum I copied the last pg_clog file to the file that was missing. I am pretty sure that this is very dangerous but it allowed me to complete the vacuum on the table. I was hoping to see some more errors but I got none. I completely dropped the database reformatted the hard disk and created a new file system on it. I had originally used the "largefile4" option of mke2fs but I have now left it at default.
]# df -m
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3124 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 2369 5648 30% /links/pg_xlog
/dev/hdb1 9628 2034 7105 23% /links/tables
/dev/hdb2 9605 2007 7110 23% /links/temp
/dev/sda5 17093 7513 8712 47% /links/database
Interesting bits from my base directory
-rw------- 1 postgres postgres 1.0G Oct 12 2003 16992
-rw------- 1 postgres postgres 927M Oct 12 2003 16992.1
-rw------- 1 postgres postgres 121M Oct 12 2003 58021849
-rw------- 1 postgres postgres 783M Oct 12 2003 58021850
-rw------- 1 postgres postgres 752M Oct 12 2003 58021852
-rw------- 1 postgres postgres 1.0G Oct 12 2003 58021854
-rw------- 1 postgres postgres 68M Oct 12 2003 58021854.1
-rw------- 1 postgres postgres 872M Oct 12 2003 58021856
-rw------- 1 postgres postgres 872M Oct 12 2003 58021857
We can see that we have dropped the total size of the database considerably.
links=# select relname, relfilenode, relpages from pg_class order by relpages desc limit 9;
relname | relfilenode | relpages
--------------------------------+-------------+----------
child_links | 16992 | 249791
child_links_pkey | 58021854 | 139723
home_page | 16982 | 116267
parent_url_id_idx | 58021856 | 111577
child_url_id_idx | 58021857 | 111577
home_page_pkey | 58021850 | 100253
home_page_url_key | 58021852 | 96209
hp_url_id_index | 58021849 | 15434
pg_proc_proname_args_nsp_index | 16640 | 125
50 Million links found
6.7 Million unique links found
Another day, another few million links. Since moving some of the data around and recreating the database there is definitely an increase in performace. I vacuum the database regularly because of the amount of updates that take place.
I am off into London on Saturday in search of some more hardware, I never intend to use www.scan.co.uk again. I am going to have a trawl around the computer fairs to see what I can find. I would really like to get a dual chip motherboard, and run a couple of the new Opterons on it. I will have to see what I can afford first then decide on what to do. At the moment it is not really processing power that is limiting me it's the I/O on the system. I have currently got 1Gb of RAM installed which is the most my motherboard can handle. The disks I am using are not really the quickest in the world either so I need to get some decent 80 Conductor cables for the IDE disks. If there was more RAM in the PC and a few more disks to move some of the database files onto the Athlon XP1700 would start to suffer. I have been looking at the MSI and Tyan motherboards with onboard SATA. They are expensive but they would be the perfect choice for what I am doing. I really wish I had more room, I could then build some smaller PC's to run more robots.
I am going to start rethinking the layout of the tables. For instance at the moment I am storing duplicate links in the links_found table that are already in the home_page table. These links vary in size from fairly small to massive so I think that an integer value taken from the home_page(url_id) column would be a more efficeint use of space. I am also think that, because the CPU is being under utilised,I should seperate the downloading of the pages from the parsing. This would mean I could get more efficient use of all the resources currently open to me.
44.8 Million links found
6.44 Million unique links found
After the amount of moving around that I did yesterday it is hard to believe that nothing went wrong. Lots of things went wrong but I will not bore you with them they are trivial to what I did today.
I started moving the "/usr/local/pgsql/data/base/links_database/files" around to make room in various places and managed to corrupt one of them. You can imagine my panic. I quickly put everything back to where it was and started Postgres and checked to see if the file I had corrupted was an index or a table. If it had been a index I could have just dropped it and recreated it but, it was the links_found table. However because I make regular backups of everything this gave me the chance to test them out.
Just recently I made some comments that I was going to use gzip, rather than bzip2, to do my backups, due to the amount of time that bzip2 takes to do anything. Trust my luck the gzip backup that I had taken recently was corrupted giving all sort of end of file errors etc. I need to investigate this because I cannot afford the time, to use bzip2 for my backups.
I used an old backup file from a few days ago. This is how I recreated the database for anyone that is interested.
]# psql template1
template1=# drop database links;
template1=# \q
]# createdb links
]# cat /links_database_01_10_03_15\:12\:57.sql.bz2 | bunzip2 | psql links
It looks like a very old backup from 7 days ago, but since the robots have not been running for a few days during that time so there was little data loss. The whole operation took a couple of hours. I also tried something a bit dangerous, when the database was halfway through re-creating itself I checked to see which data files had been created and had reached the max limit of 1Gb. I then moved the oldest of these to another file system and created a soft link to it. I know that there is probably an easier way do it than this, but because the database is bigger than any of my file systems I needed a quick and dirty method to free file space to avoid running out of room. If anyone knows a better way to do this I would like to know what it is, you know how to contact me.
Just when I thought everything was ok, I got the following error.
links=# select now (), count(*) from links_found;
ERROR: cannot read block 477160 of links_found: Input/output error
links=# select relname, relfilenode, relpages from pg_class order by
relname | relfilenode | relpages
--------------------------------+-------------+----------
links_found | 71509890 | 456987
links_found_pkey | 112850939 | 418056
lf_found_url_idx | 112850933 | 397954
home_page | 71509882 | 90268
home_page_pkey | 112850935 | 77280
home_page_url_key | 112850937 | 74141
hp_url_id_index | 112850934 | 11990
pg_proc_proname_args_nsp_index | 16640 | 125
pg_proc | 1255 | 58
pg_depend | 16598 | 20
I knew this could mean I have a problem on my file system, I was having visions of one of my disks now being completely screwed. I found out what file system it was on using the above filename and then did the following.
[root@harry 71509876]# /etc/init.d/postgres stop
Stopping PostgreSQL: ok
[root@harry 71509876]# cd
[root@harry root]# umount /dev/sda5
[root@harry root]# e2fsck -c /dev/sda5
e2fsck 1.26 (3-Feb-2002)
Checking for bad blocks (read-only test): done
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 766: 206483
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
File /data/base/71509876/71509890.3 (inode #766, mod time Wed Oct 8 13:16:55 2003)
has 1 duplicate block(s), shared with 1 file(s):
[root@harry 71509876]#
We can see straight away that there is a problem on my links_found table again. To fix this I ran e2fsck using the "-f" option and chose the defaults when asked questions. I ran it again to make sure that the defaults where not causing any trouble, and the database is now back in business.
40 Million links found
6 Million unique links found
I had another bad day today. I need to return all the hardware to www.scan.co.uk, which means moving various bits and bobs around and restructuring my file system to make as much space as possible. The following was my file system before I started playing with it.
harry]#df -m
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 961 2777 26% /
/dev/hdc1 30 9 20 30% /boot
/dev/sda1 3937 3006 731 81% /home
none 505 0 504 0% /dev/shm
/dev/hdc7 1969 33 1836 2% /tmp
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 3938 3108 630 84% /opt/links/hdc5
/dev/hdb1 9628 6955 2184 77% /opt/oracle
/dev/hdb2 9605 4884 4233 54% /opt/oracle/oradata/ide12
/dev/sda2 3937 1019 2719 28% /opt/oracle/oradata/scsi02
/dev/sda5 9365 8625 272 97% /opt/oracle/oradata/scsi03
/dev/hde1 156319 33 148470 1% /opt/oracle/oradata/hde1
After a bit of moving around and shifting of file systems I managed to get it into the following order. It is not ideal yet, but it gets the robots started again.
[root@cpc3-lutn1-6-0-cust26 root]# df -m
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 2567 1170 69% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 33 7985 1% /links/pg_xlog
/dev/hdb1 9628 4222 4917 47% /links/tables
/dev/hdb2 9605 752 8365 9% /links/temp
/dev/sda5 17364 10500 5996 64% /links/database
I now have a few extra Gb for the database that I did not have before. Something that was causing me some concern was what to do if I ever had to restore a backup. I had to make sure that I had a single contiguous file system that would contain the whole database. This is the reason I made the SCSI disk into a single file system.
35.876 Million links found
5.476 Million unique links found
I have had an absolute nightmare of a day trying to get various bits of hardware to operate together. After further investigation we quickly came to the conclusion that www.scan.co.uk 's RAM had shagged the PC. I knew that as the current law stands that I have not got a leg to stand on, or at least I would have if I had the money to chase them for the next several months proving that there hardware was to blame.
Before anyone says that I am not really qualified to make such a general statement about their hardware I should probably state that I was "PACE" qualified for 3 years in a row and spent 5 years as a satellite engineer so I know my way around a circuit board better than most.
For reference I have also tried to get the Highpoint Card running better. It was giving me some really slow speeds when using hdparm. So I downloaded the driver from the website, installed it, and for a few minutes it tripled my hdparm speeds then it froze the PC. I tried it twice before giving up.
BOLLOCKS BOLLOCKS BOLLOCKS BOLLOCKS BOLLOCKS BOLLOCKS
Watford Electronics only do 512Mb sticks so I am limiting my upgrade options in the future, but I need the RAM so they got the sale. I put the two 512Mb into the machine and it worked a treat. Now all I need to do is compile a kernel that has support for the Highpoint HPTxxx Controllers and edit the postgres.conf file and shmmax settings so that we can take advantage of the new RAM.
Unfortunately compiling a kernel for this beast is not really that straight forward. I needed to get the latest up2date packages and keys etc from redhat then I needed to download the source rpm and install that. I do not have an old .config file for this machine so I had to use one of the ones provided in the source install and customise it for my requirements. I was a bit surprised when it worked first time. Needless to say I have left out a lot of the details, because it was a very frustrating exercise.
I managed to move the 20Gb disk out of the PC for J's dad. This involved moving several of the Postgres files around and onto disks that are not really ideal places to have them. I will sort this out before I start the robots again in a few days time. So I am off to build a PC . They need the PC because they need to research cycling across Canada on a tandem.
35.876 Million links found
5.476 Million unique links found
I am going to build a PC for J's parents from old bits that I have in storage or should I say bits that I have no room for here. I am taking the 512Mb PC133 from this machine and adding the 1Gb stick that arrived today along with the Maxtor 160 SATA drive and Highpoint Controller 1542-4. I am also giving him one of my 20Gb drives because he's only got one 1.7Gb and a 127Mb drive in his old PC.
Needless to say the hardware that arrived from www.scan.co.uk today was a bit dodgy. The RAM completely trashed one of our PC's and we are now unable to get it running again. I now need to get more RAM from Watford Electronics to replace the piece of crap www.scan.co.uk sent. Unfortunately Watford Electronics are not open until tomorrow so its back to the old config for a little bit more spidering.
35.876 Million links found
5.476 Million unique links found





