I know, I talked in my previous entry about why you should use Symfony plugins instead of reinventing the wheel. But our requirements are not covered by design by the sfLucene plugin and Lucene is the wheel not to reinvent, so it was only up to me for a small integration. Just a few lines of code.

I have done that basically the whole afternoon today, so lets shortly wrap it up:

  • Lucene is a open source search engine.
  • Zend created a PHP implementation called ZendSearchLucene.
  • There are some good blog entries describing the integration
    • Dave Dash provided the initial tutorial, based on some old ZSL implementation
    • Peter van Garderen uses Daves tutorial and adds some comments for newer versions.
    • Johannes Schmidt (blog in german) gave me the final hintes to get UTF-8 working.

I took the latest ZSL 1.0.1 and just the search files. In 1.01 the ones i dropped in my_app/lib are:

Zend/Loader.php
Zend/Exception.php
Zend/Search/*

The next problem was now the autoloading, as he Zend.php file was no longer there. I anyway wanted to create a wrapper class which encapsulates loading the index and running the finds. So because of that I created my own ZendSearchLucene.class.php in my_app/lib:

require_once('Zend/Loader.php');
Zend_Loader::loadClass('Zend_Search_Lucene');
Zend_Loader::loadClass('Zend_Search_Exception');
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding("UTF-8");
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num(),
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num());
 
class ZendSearchLucene {
  const INDEXROOT = SF_ROOT_DIR.DIRECTORY_SEPARATOR.'data'.
  DIRECTORY_SEPARATOR.'search'.DIRECTORY_SEPARATOR;
  const VERSION = '1.0.1';
  public static function addOrUpdate($doc,$id,$area){
    $index = null;
    try {
      $index = Zend_Search_Lucene::open($self::INDEXROOT.$area);
    } catch (Zend_Search_Lucene_Exception $e) {
      $index = Zend_Search_Lucene::create($self::INDEXROOT.$area);
    }
    $term = new Zend_Search_Lucene_Index_Term($id, 'myid');
    $query = new Zend_Search_Lucene_Search_Query_Term($term);
    $hits = $index->find($query);
    foreach ($hits AS $hit) {
      $index->delete($hit->id);
    }
    $index->addDocument($doc);
  }
 
  public static function find($query,$area){
    $index = Zend_Search_Lucene::open($self::INDEXROOT.$area);
    return $index->find(mb_strtolower($query,"UTF-8"));
  }
}

Looks complicated, but that is basically all I needed to do. You will find ideas from all three tutorials in that code. So lets give em credit:
The autoloading in the beginning was my creation.
The UTF-8 solution is very fragile. It never worked as described on other pages. The only change that made it possible is to set both Analyzers to UTF-8 but I must not set any UTF-8 somewhere else. (so no UTF-8 param in the field creation) . Thank you Johannes for the ideas. The UTF-8 mess took most of my time today.
Most of the addOrUpdate code comes from Dave, however I modified it. Thanks to the hint of Peter, the API changed and now offers a static open and create method. Unfortunately there is no create if not exists option. So I try to open and on exception I create it.
I also made a variable index for different areas (lets say a forum and a blog) where the search should stay inside that area.
Peters instructions and some comments helped me to resolve the issue with the id column. Giving it an own alias was enough, so I wouldn’t go so far to recommend not to give a DB column the name ID.

And that is already most of the stuff needed to do. It just takes some time to get it sorted out. Final credit again shall go to Dave for his talk about Zend_Search_Lucene, which inspired me to take it and not to use mysql like %xyz% calls.

Hope I could provide with this collection and amendments some help :-)

PS: If you are a Java guy it is very interesting to see how much effort PHP guys put into namespacing. Zend_Search_Lucene is a prime example of that. The even invent their own “classloader” which then maps the underscores to directories :)

Update2:
The UTF-8 was not working correctly, but now I have it. I did not notice that calling strtolower on an utf-8 string will corrupt it, there are some cases where it might work, but to be safe, always use mb_strtolower so my field generation looks like this:

$titleField = Zend_Search_Lucene_Field::Text('title', mb_strtolower($this->getTitle(),"UTF-8"),"UTF-8");
$titleField->boost = 1.5;
$doc->addField($titleField);

So it looks that without this it worked okay on my devenv but not on the test system (most likely because the php locale, and thus the conversion magic, was different). I also updated the code above.