Both at work and on this site, I use TWiki as my wiki engine of
choice. TWiki has managed to attract a fair share of plugin and add-on writers,
resulting in wonderful tools like an add-on which integrates KinoSearch,
a Perl library on top of the Lucene search engine.
This month, I installed the add-on at work. It turns out that in its current state,
it does not support Office 2007 document types yet, such as
.docx
,
.pptx
and
.xlsx
,
i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since
these days, most new Office documents tend to be provided in those formats.
The KinoSearch add-on doesn't try to parse (non-trivial) documents
on its own, but rather relies on external helper programs which extract
indexable text from documents. So the task at hand is to write such
a text extractor.
Fortunately, the
Apache POI project just released
a version of their libraries which now also support OpenXML formats, and
with those libraries, it's a piece of cake to build a simple text extractor!
Here's the trivial Java driver code:
package de.clausbrod.openxmlextractor;
import java.io.File;
import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
public class Main {
public static String extractOneFile(File f) throws Exception {
POITextExtractor extractor = ExtractorFactory.createExtractor(f);
String extracted = extractor.getText();
return extracted;
}
public static void main(String[] args) throws Exception {
if (args.length <= 0) {
System.err.println("ERROR: No filename specified.");
return;
}
for (String filename : args) {
File f = new File(filename);
System.out.println(extractOneFile(f));
}
}
}
Full Java 1.6 binaries are
attached;
Apache POI license details apply.
Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.
With this tool in place, all we need to do is provide a
stringifier plugin to
the add-on. This is done by adding a file called
OpenXML.pm
to the
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins
directory in the TWiki server installation:
# For licensing info read LICENSE file in the TWiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details, published at
# http://www.gnu.org/copyleft/gpl.html
package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;
use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase';
use File::Temp qw/tmpnam/;
__PACKAGE__->register_handler(
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx");
__PACKAGE__->register_handler(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx");
__PACKAGE__->register_handler(
"application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx");
sub stringForFile {
my ($self, $file) = @_;
my $tmp_file = tmpnam();
my $text;
my $cmd =
"java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file";
if (0 == system($cmd)) {
$text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file);
}
unlink($tmp_file);
return $text; # undef signals failure to caller
}
1;
This script assumes that the
openxmlextractor.jar
helper is located at
/www/twiki/local/bin/openxmlextractor
; you'll have to tweak this path to
reflect your local settings.
I haven't figured out yet how to correctly deal with encodings in the stringifier
code, so non-ASCII characters might not work as expected.
To verify local installation, change into
/www/twiki/kinosearch/bin
(this is
where my TWiki installation is, YMMV) and run the extractor on a test file:
./ks_test stringify foobla.docx
And in a final step, enable index generation for Office documents by adding
.docx
,
.pptx
and
.xlsx
to the Main.TWikiPreferences topic:
* KinoSearch settings
* Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx
to top