Sunday, November 29, 2009

Desktop search engines compared

Intro

I have a large electronic library (over 15,000 books) and I was looking for a way to cope with this mass of information. I didn't like the idea of a special catalog, since it would take a lot of manual work to enter the metadata. Besides, my books are in various formats, from HTML to RTF, to DOC, to PDF, to DjVU. These files lack metadata way too often and I thought a local indexing service with a full-text search might solve my problems. I knew there are more options to choose from than just Google, but I could not find a good modern comparison. I had to compare them myself. Even the table in Wikinfo's Comparison of desktop search software contained too many errors, as I found.

My task imposed certain restrictions on the one hand, but made the others irrelevant on the other hand. So, I was especially interested in a wide gamut of file types, in the ability to add new ones (Epub, fb2, html.zip) and in extensive query language. All software, except for GDS and DocFetcher, was installed from Ubuntu 9.10 repositories.

I have no special preferences regarding the backend, it may be Xapian- or Lucene-based tool, or even a custom backend. On the other hand, Xapian usually requires more disk space, and there is never too much space on desktops.

Beagle

http://beagle-project.org

The list of supported file types is quite large, it includes typical office files, source code, LaTeX source, images, audio and video files, RPM and DEB packages, e-mail from Evolution, Thunderbird and Kmail, IM and IRC logs, RSS feeds and many more (see here: http://beagle-project.org/Supported_Filetypes) and you are free to extend it. I could add new file types by editing one file, /etc/beagle/external-filters.xml.

Indexing process can run in two ways: CPU-lenient and CPU-intensive (using EXERCISE_THE_DOG environment variable). The search engine is based on Lucene.Net. I have no idea why the developers chose this exotic platform to implement Beagle, but Beagle works, and it works well.

Beagle understands limited (very limited, actually) regexps (*). You can search for phrases, exclude words (-word), use OR operator, specify dates when the file was created (on, before, after and between!), limit the search with a file type and define the directory where to look for the files. Unfortunately, you cannot point at the directory under which Beagle should search.

You can even use the metadata of audio and image files, as in the examples from the manual:

artist:Beatles ext:mp3 OR ext:ogg -album:"Abbey Road"

You can specify to search in mail attachments, to search by music genres, mailing lists, IM correspondents and much, MUCH more.

Beagle tends to create huge log files in ~/.beagle/Logs.

Beagle has a web interface. It's very easy to start using it, but not so easy to make use of it, since the alleged links to the results are not exactly links.

Beagle web-site includes information on the query syntax and extending Beagle, but finding the information is next to impossible unless you use Google.

Description of query syntax is here: http://beagle-project.org/Searching_Data

The index for a 45Gb home partition was only about 700Mb.

Google Desktop Search

http://desktop.google.com/linux/

Supports OpenOffice and MS Office files, PDF, HTML, TXT, audio and image files and email from Thunderbird. Strangely enough, it does not index zipped archives.

I could not add new file types, not even plain text with a different extension. I was pretty sure that GDS supports stemming, but not regexps. To my surprise, stemming did not work in GDS. Nor did regexps. It does not even support AND and OR keywords.

Otherwise the query syntax is acceptable. You can point at the directory where the file you are looking for is located, or the directory, under which the file is supposed to be. You can search for phrases or exclude words.

I was using GDS for some years and it works great as long as you use it in the way Google intended it to work. While suitable for and average office cubicle, it was next to useless for my purposes.

Index size was about 1.7Gb for 50Gb of data.

Recoll

http://www.lesbonscomptes.com/recoll/

A large number of file types is supported natively, including plain text, HTML, maildir and mailbox files, OpenOffice, MS Office 2007, Abiword, LyX, Kword and Scribus files and GAIM logs. Many more are supported with external helpers: DOC, XLS, PDF, DjVU, MP3, image files and so on. Feel free to add to the list, it's easy: one file establishes associations between extension and mimetype, another one specifies how the data is extracted from a file of a certain MIME type, and the third one defines applications used to open MIME types.

Recoll is built around the Xapian engine.

I had an impression that the indexing process takes much longer with Recoll than with the other tools. When indexing RTF with unrtf, Recoll created a heap of WMF files in my home directory. Recoll has no indexing daemon that would run in the background all the time. Instead. Recollindex is to be launched from time to time (with cron, for example).

The manual mentions stemming support, but also points that this is done the other way round. Stemming is not included in the database, as in other indexing engines, but the query is stemmed instead. Unfortunately, my version gave different results when searching for plural 'notebooks' and singular 'notebook', so, I assume, stemming does not work in my installation of Recoll. Recoll understands regexps pretty well, which to a certain degree compensates for the problems with stemming.

Rich query language, modeled after Xesam End User Search language (see here). Like with Beagle, you can use the dir: prefix to limit the search path to one directory, but you cannot specify a directory tree. Alas! Other useful prefixes include title, author, ext (for file type), etc.

The search client, recoll, is a GUI program, but with the -t option it runs in text mode. It means that instead of specifying a directory tree, I can just grep the results for a string, like this:

recoll -t -q \"jack london\"|grep /library/fiction/adventure

Note that for the command line client, you have to escape quotation marks to denote a phrase search.

Recoll, unlike some other tools, has a decent user manual, containing information on query syntax and adding support for new file types.

The index size threw a damper on me. For a 50Gb home directory it was more than 5Gb.

Strigi

http://strigi.sourceforge.net/

Strigi supports regular expressions. Theoretically, Strigi should support plain text files, PDF, DEB and RPM packages, OpenOffice documents and zipped files. Besides, Strigi was the only program that successfully indexed EPUB files without customization, interpreting them as just plain ZIP-archives with HTML, NCX, etc. inside.

There's little I can say about this program. The daemon kept crashing when I tested it so I could not even finish building the index for my home directory. The client erroneously classified a lot of hits as being "email".

The incomplete (?) index size was about 750Mb.

Tracker

http://projects.gnome.org/tracker

Tracker is a part of GNOME project and it tries to adhere to various useless technologies, like DBus. Tracker introduces the concept of file tags thus overcomplicating the task of file management. I admit that the notion of file tags might be reasonable, but only if it is supported universally, if tags are freely backed up, copied, etc. Now, fortunately, the tags are not obligatory for Tracker

The full list of supported file types is unavailable, but the web-site talks about image, audio, video, text files, source code, applications, playlists, IM converstaions and so on. No email, nor bookmarks, nor contacts as yet, though. The indexing daemon would segfault occasionally and I could not finish indexing.

As a matter of fact, Tracker was designed as a metadata search tool (and its full name is MetaTracker), but the normal use case is just full text search. Tracker was written to work well even on machines with 128 or 256 Mb RAM. Judging by the slowness of indexing, this statement could be true. I was wrong, Recoll was not the slowest indexer, it was Tracker.

I could not find a good user manual.

DocFetcher

http://docfetcher.sourceforge.net/en/index.html

Supported file types: HTML, plain text, PDF, Microsoft Office (doc, xls, ppt), Microsoft Office 2007 (docx, xlsx, pptx), OpenOffice.org Writer, Calc, Draw and Impress, RTF, AbiWord (abw, abw.gz, zabw), CHM, Visio, SVG.

Written in Java. Fast and CPU-sparing indexing. DocFetcher comes in two flavors: a binary installable package and a "portable" version, which you can run right from your home directory.

DocFetcher supports regular expressions (at least * and ?). Phrase search, AND and OR keywords, search in content or in metadata: author and title fields are supported. Does not index zipped files. It is easy to add new filename extensions that are treated as yet another text file or HTML, but I could not add a new file type which is to be treated in a special way. For me this means that I cannot process custom XML to convert the content to the proper charset. It's a problem.

An interesting query feature is boosting terms: "You can assign custom weights to words, thus increasing or decreasing the level of matching for a particular document if the weighted word occurs in it. This allows you to influence the relevance sorting of the result page. Example: dog^4 cat will bring up the documents with "dog" in it on the top of the result page."

The manual can be found in the downloaded archive, but it is very brief.

Pinot

http://pinot.berlios.de/

Like Tracker and Strigi, Pinot is built for D-Bus. Its indexing engine uses the same Xapian engine as Recoll, so I could use Pinot text-mode client to query the database built by Recoll indexer. Pinot can use other databases, but I was not interested in this option. The crawler takes a huge share of RAM and CPU. It ate up 70% of RAM on my PC, causing some other programs to crash, so I had to leave it for a night to complete indexing.

The documentation consists of one Readme file and a couple of web-pages. Quoting these web-pages, "The following document types are supported internally :

  • plain text
  • HTML
  • XML
  • mbox, including attachments and embedded documents
  • MP3, Ogg Vorbis, FLAC
  • JPEG
  • common archive formats (tar, Z, gz, bzip2, deb)
  • ISO 9660 images
The following document types are supported through external programs :
  • PDF (pdftotext required)
  • RTF (unrtf required)
  • OpenDocument/StarOffice files (unzip required)
  • MS Word (antiword required)
  • PowerPoint (catppt required)
  • Excel (xls2csv required)
  • DVI (catdvi required)
  • DjVu (djvutext required)
  • RPM (rpm required)

Indeed, new file types are defined in the file external-filters.xml very similar (but not identical, Pinot developers warn) the the file with the same name used by Beagle.

I have to say that these external programs made indexing of PDF, RTF and other files a difficult task. Indexing a PDF document took up to two minutes.

Conclusion

Recoll and Pinot may be considered good alternatives to Beagle, but the size of the Xapian index database leaves just one choice for me, Beagle.

No comments:

Post a Comment