This month, I installed the add-on at work. It turns out that in its current state,
it does not support Office 2007 document types yet, such as .docx
, .pptx
and .xlsx
,
i.e. the so-called "Office OpenXML" formats. That's a pity, of course, since
these days, most new Office documents tend to be provided in those formats.
The KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor.
Fortunately, the Apache POI project just released a version of their libraries which now also support OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code:
package de.clausbrod.openxmlextractor; import java.io.File; import org.apache.poi.POITextExtractor; import org.apache.poi.extractor.ExtractorFactory; public class Main { public static String extractOneFile(File f) throws Exception { POITextExtractor extractor = ExtractorFactory.createExtractor(f); String extracted = extractor.getText(); return extracted; } public static void main(String[] args) throws Exception { if (args.length <= 0) { System.err.println("ERROR: No filename specified."); return; } for (String filename : args) { File f = new File(filename); System.out.println(extractOneFile(f)); } } }
Full Java 1.6 binaries are attached; Apache POI license details apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice.
With this tool in place, all we need to do is provide a stringifier plugin to
the add-on. This is done by adding a file called OpenXML.pm
to the
lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins
directory in the TWiki server installation:
# For licensing info read LICENSE file in the TWiki root. # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details, published at # http://www.gnu.org/copyleft/gpl.html package TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML; use base 'TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase'; use File::Temp qw/tmpnam/; __PACKAGE__->register_handler( "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", ".xlsx"); __PACKAGE__->register_handler( "application/vnd.openxmlformats-officedocument.wordprocessingml.document", ".docx"); __PACKAGE__->register_handler( "application/vnd.openxmlformats-officedocument.presentationml.presentation", ".pptx"); sub stringForFile { my ($self, $file) = @_; my $tmp_file = tmpnam(); my $text; my $cmd = "java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '$file' > $tmp_file"; if (0 == system($cmd)) { $text = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor($tmp_file); } unlink($tmp_file); return $text; # undef signals failure to caller } 1;
This script assumes that the openxmlextractor.jar
helper is located at
/www/twiki/local/bin/openxmlextractor
; you'll have to tweak this path to
reflect your local settings.
I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected.
To verify local installation, change into /www/twiki/kinosearch/bin
(this is
where my TWiki installation is, YMMV) and run the extractor on a test file:
./ks_test stringify foobla.docx
And in a final step, enable index generation for Office documents by adding
.docx
, .pptx
and .xlsx
to the Main.TWikiPreferences topic:
* KinoSearch settings * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx