Search engines
/\

I am still searching for a suitable search engine, that can index my local collection of howtos, api docs, tutorials, magazine archives, howtos and so on. Here you can read abut my experiences

  1. My test data
    I extracted the Linux Howtos (HTML Version) into a directory and made it accessible via apache. That directory contains about 470 html files and uses 34 MB disk space.

  2. Lucene
    Lucene is the pretty bare indexing engine, you must feed it with key strings and words. That means: crawling is your job. I implemented a basic crawler and used a html parser to extract links.
    • Index size: 30.5 MB, too large. There was no stopword skipping or any other logic that reduced the amount of data.
    • Indexing speed: a bit slow, no optimization for speed at all.
    • Finding results: complete, good.
    • Finding speed: good. However, I did not try to index large data trees.

  3. Nutch
    Nutch is the more complete search application. I tried the tutorial, crawling was done pretty fast. Some difficulties concerning user rights arrived when using the tomcat installation at gentoo. The documentation hides details about how to assign the crawled database into the run container of the tomcat installation. After I solved that (linking the directory crawl.test/segments to /opt/tomcat5/segments ), the search was operational.
    • Indexing speed: a bit slow, uses an internal delay to reduce system load.
    • Index size: 7 MB, okay.
    • Finding results: incomplete. Some keywords did not show up at all, some other keywords resulted only one page, but do exist on two pages. Disappointing.
    • Finding speed: good.

  4. Sphider
    The next candidate is a lightweight php engine: Sphider
    mysql -u root -p create database sphider_db; GRANT SELECT,INSERT,UPDATE,DELETE,CREATE,DROP ON sphider_db.* TO sphideruser@localhost IDENTIFIED BY 'sphiderpw'; quit; mysql -u sphideruser -p
    Had to re-enable cookies for local login to the admin page.
    • Indexing speed: The sphider php script, executed from apache took more than 70 minutes to digest the test data. Too slow.
    • Index size: 50 MB inside mysql. Too large.
    • Finding results: okay. Search was possible during indexing.
    • Finding speed: good for single words. For several words: between 1 and 2 seconds.

  5. TSEP
    TSEP is a php search engine, configurable by web interface. Installing easy, but I did not succed with using it. When telling the index manager form to spider my local test data "http://scratch.neh/howtos", it did not find any files. I tried to set "find files per directory search", then TSEP finds files, but the wrong ones. It tried to index the local file system, beginning with /dev/ recursively. The same behaviour with forcing http reads.
    Unlucky.

  6. PHPDIG
    todo

  7. swish-e
    todo