Search engine restarted

Well, I have managed to get my PC back in action and now have a decent sized disk (SATA 160Gb) installed so I started the robots again. They have been running for a couple of days now and so far I have collected 100,000 pages and dumped them on disk. This is on a 600k connection and it' not running all the time. My target for testing is 1 million pages so I should have these by the end of May then the robots will be tamed a bit.

Once the pages are down I then need to figure out how I'm going to represent the documents on disk. There are various methods for this but I am intending to emulate an already popular search engine ;-) or at least do it the way they started and figure it out as I go along.

I intend to use C++ to do all the document parsing etc. This choice was made simply because I have not got time to roll my own binary trees etc or to learn a new library. I am fairly familiar with the STL so I will work with it..

Add to delicious Digg This Add to My Yahoo! Add to Google Add to StumbleUpon
| | Comments (2)

2 Comments

owen said:

my motto "the faster you can code it the sooner you find the bugs"

Harry said:

How true, I have been writing the word parser using the standard c++ map to hold the uniques word list. I then bumped up the file list to 5000+ and I noticed a major decrease in performance which was quickly fixed using ext/hash_map.

Onwards and upwards.

Leave a comment

About this Entry

This page contains a single entry by Harry published on April 30, 2004 12:07 PM.

Kernel Oops was the previous entry in this blog.

Casa Pablo is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01