#define private public - TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)

TWiki, KinoSearch and Office 2007 documents (20 Jul 2009)

Both at work and on this site, I use TWiki as my wiki engine of choice. TWiki has managed to attract a fair share of plugin and add-on writers, resulting in wonderful tools like an add-on which integrates KinoSearch, a Perl library on top of the Lucene search engine.

This month, I installed the add-on at work. It turns out that in its current state, it does not support Office 2007 document types yet, such as .docx, .pptx and .xlsx, i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since these days, most new Office documents tend to be provided in those formats.

The KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor.

Fortunately, the Apache POI project just released a version of their libraries which now also support OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code:

package de.clausbrod.openxmlextractor;

import java.io.File;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;

public class Main {
    public static String extractOneFile(File f) throws Exception {
        POITextExtractor extractor = ExtractorFactory.createExtractor(f);
        String extracted = extractor.getText();
        return extracted;
    }

    public static void main(String[] args) throws Exception {
        if (args.length <= 0) {
            System.err.println("ERROR: No filename specified.");
            return;
        }

        for (String filename : args) {
            File f = new File(filename);
            System.out.println(extractOneFile(f));
        }
    }
}

Full Java 1.6 binaries are attached; Apache POI license details apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.

With this tool in place, all we need to do is provide a stringifier plugin to the add-on. This is done by adding a file called OpenXML.pm to the lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins directory in the TWiki server installation:

# For licensing info read LICENSE file in the TWiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details, published at 
# http://www.gnu.org/copyleft/gpl.html

package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;
use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase';
use File::Temp qw/tmpnam/;

__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx");
__PACKAGE__->register_handler(
  "application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx");

sub stringForFile {
  my ($self, $file) = @_;
  my $tmp_file = tmpnam();

  my $text;
  my $cmd = 
    "java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file";
  if (0 == system($cmd)) {
    $text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file);
  }

  unlink($tmp_file);
  return $text;  # undef signals failure to caller
}
1;

This script assumes that the openxmlextractor.jar helper is located at /www/twiki/local/bin/openxmlextractor; you'll have to tweak this path to reflect your local settings.

I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected.

To verify local installation, change into /www/twiki/kinosearch/bin (this is where my TWiki installation is, YMMV) and run the extractor on a test file:

  ./ks_test stringify foobla.docx

And in a final step, enable index generation for Office documents by adding .docx, .pptx and .xlsx to the Main.TWikiPreferences topic:

   * KinoSearch settings
      * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx

Revision: r1.2 - 20 Jul 2009 - 10:13 - ClausBrod

Blog > DefinePrivatePublic20090720TWikiKinoSearch