Lucene Logo

Find It With Lucene

by Cameron Harr


    There are currently many text search engines available on the Web or simply written by your average Joe programmer. However, a  problem exists in that there hasn’t been a robust, extensible, and open source option available, and thus instead of building a better “wheel,” it is constantly reinvented. Not only is this inefficient, but it prevents a more universally accepted project which can progress with the input of brilliant programmers worldwide. But don’t fret too much for there is now a solution: Lucene. From the home page of the Lucene project, “Lucene is a high-performance, full-featured text search engine written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.( Stevens 2002 )”

    As mentioned, Lucene is fast, written in Java, and is open-source. Because of these properties, the project can easily and quickly be adopted by developers and integrated into their own projects and applications. Despite it’s speed, Lucene is also very easy to work with. After setting the correct classpaths (which are provided in the documentation) and running a single command (also provided), I had indexed about 20 files in 884 milliseconds. These indices were immediately available for searching, instructions of which are provided and which is equally fast.

    Before we get too far ahead of ourselves, what exactly is Lucene and, in general, a text-based search engine? We spoke in the previous paragraph of indexing and searching; what does that mean? A text-based search engine like Lucene is basically a program that will take a given set of files and parse the contents, sticking the words into an index which can be searched quickly at a later date. With Lucene specifically, there is much more that can be done. As with most  architectures, Lucene uses a hashing table system to store the indexes, providing a speedy lookup. There is also a Java class they have created and utilize called Analyzer which acts as a filter. "An Analyzer is a class (or class instance) that controls how the content of the document is broken into 'terms' (words) during the indexing. For example, an analyzer can break the text 'The ill dogs' into the tokens 'the', 'ill, dogs' (lower case) while another analyzer can break it into 'ill', 'dog' (lower case, plural normalized to singular and the common word 'the' removed'). (Stevens 2002) " A developer is free to create new analyzers, but there are several available for use. For instance the following  is a stopword analyzer written by Joanne Proton:

public class MyAnalyzer  extends Analyzer
{


  /*
   * An array containing some common words that
   * are not usually useful for searching.
   */
  private static final String[] STOP_WORDS =
  {
     "a"       , "and"     , "are"     , "as"      ,
     "at"      , "be"      , "but"     , "by"      ,
     "for"     , "if"      , "in"      , "into"    ,
     "is"      , "it"      , "no"      , "not"     ,
     "of"      , "on"      , "or"      , "s"       ,
     "such"    , "t"       , "that"    , "the"     ,
     "their"   , "then"    , "there"   , "these"   ,
     "they"    , "this"    , "to"      , "was"     ,
     "will"    ,
     "with"
  };

  /*
   * Stop table
   */
  final static private Hashtable stopTable = StopFilter.makeStopTable(STOP_WORDS);

  /*
   * Create a token stream for this analyzer.
   */
  public final TokenStream tokenStream(final Reader reader)
  {
    TokenStream result = new StandardTokenizer(reader);

    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopTable);
    result = new PorterStemFilter(result);

    return result;
  }

    This stopword analyzer has a simple and very useful function: it compiles a list of common or otherwise unwanted words and excludes these from the index. Another nifty analyzer is the   PorterStemmer analyzer. This interesting algorithm will parse the root word out of a given word. For instance, it will match the term  "search" to both "searching" and "searched." As the need arises, a developer can code his/her own analyzer to enhance the searching power of Lucene.

Lucene also supports complex search queries, as any respectable search engine should. The following is the syntax of a query and is taken from the  web page :

 Query       ::=  Clause  ( [ Conjunction ] Clause ) *

 Conjunction ::= 'AND' | 'OR' | '||'

 Clause      ::=  [ Modifier ] [ FieldName ':' ] BasicClause  [ Boost ]

 Modifier    ::= '-' | '+' | '!' | 'NOT'

 BasicClause ::= ( Term | Phrase | | PrefixQuery '(' Query ')'

 PrefixQuery ::= Term '*'

 Phrase      ::= '"' Term * '"'

 Boost       ::= '^' DecimalDigit+ '.' DecimalDigit+

 Term        ::= <a-word-or-token-to-match>

 FieldName   ::= <name-of-an-indexed-field>
                       

Notes:

Examples:

  +happy +(dog or cat) 
  "happy dog"
  (foo OR bar) AND (baz OR boo)
  ((a OR b) AND NOT c) OR d)
  +(apple "steve jobs") -(banana orange pear)
  +title:(dog OR cat) -author:"bob dobbs"
  +(luc*) -(ball)


    Along with this query flexibility, there are other functionalities such as weighting and scoring certain terms. The original author of Lucene, Doug Cutting, coded in an algorithm that will assign scores to hits:

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t
where:   
  score_d   : score for document d   
  sum_t     : sum for all terms t   
  tf_q      : the square root of the frequency of t in the query 
  tf_d      : the square root of the frequency of t in d   
  idf_t     : log(numDocs/docFreq_t+1) + 1.0   
  numDocs   : number of documents in index   
  docFreq_t : number of documents containing t   
  norm_q    : sqrt(sum_t((tf_q*idf_t)^2)) 
  norm_d_t  : square root of number of tokens in d in the same field as t  

    By now you must be thinking, "Wow, this is pretty nice, but nowadays, to be cool, it has to be web-enabled. I just don't have need for a bunch of text in a legacy program." Well, my friend, you must have a little faith that the creators were thinking, because sure enough, they have a spiffy web interface. For instance, check out my instantiation of it  here . This is a simple, customizable interface which will search through a large index of terms. I will now go through the steps I took to set it up to show how easy it was.