Lexicon

I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl's

m/\w/

if it doesn't match this it won't go in the lexicon. I know this is a bit harsh but unfortunately I don't have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.

Add to delicious Digg This Add to My Yahoo! Add to Google Add to StumbleUpon
| | Comments (0)

Leave a comment

About this Entry

This page contains a single entry by Harry published on October 21, 2004 9:05 PM.

New species of Tiger found! Honest! was the previous entry in this blog.

Blog Shares is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Powered by Movable Type 4.01