Edit
Attach
Printable
topic end
<!-- * Set TOPICTITLE = #define private public - TWiki, <nop>KinoSearch and Office 2007 documents (20 Jul 2009) --> <style type="text/css"> pre {background-color:#ffeecc;} </style> %STARTINCLUDE% <a name="20"></a> ---+++ [[DefinePrivatePublic20090720TWikiKinoSearch][TWiki, <nop>KinoSearch and Office 2007 documents]] (20 Jul 2009) <summary> Both at work and on this site, I use [[http://twiki.org][TWiki]] as my wiki engine of choice. TWiki has managed to attract a fair share of plugin and add-on writers, resulting in wonderful tools like <a href="http://twiki.org/cgi-bin/view/Plugins/SearchEngineKinoSearchAddOn">an add-on which integrates <nop>KinoSearch</a>, a Perl library on top of the <a href="http://lucene.apache.org/">Lucene</a> search engine. </summary> This month, I installed the add-on at work. It turns out that in its current state, it does not support Office 2007 document types yet, such as =.docx=, =.pptx= and =.xlsx=, i.e. the so-called "Office <nop>OpenXML" formats. That's a pity, of course, since these days, most new Office documents tend to be provided in those formats. The <nop>KinoSearch add-on doesn't try to parse (non-trivial) documents on its own, but rather relies on external helper programs which extract indexable text from documents. So the task at hand is to write such a text extractor. Fortunately, the [[http://poi.apache.org/][Apache POI]] project just released a version of their libraries which now also support <nop>OpenXML formats, and with those libraries, it's a piece of cake to build a simple text extractor! Here's the trivial Java driver code: <pre> <font color="#a020f0">package</font> de.clausbrod.openxmlextractor; <font color="#a020f0">import</font> java.io.File; <font color="#a020f0">import</font> org.apache.poi.POITextExtractor; <font color="#a020f0">import</font> org.apache.poi.extractor.ExtractorFactory; <font color="#2e8b57"><b>public</b></font> <font color="#2e8b57"><b>class</b></font> Main { <font color="#2e8b57"><b>public</b></font> <font color="#2e8b57"><b>static</b></font> String extractOneFile(File f) <font color="#2e8b57"><b>throws</b></font> Exception { POITextExtractor extractor = ExtractorFactory.createExtractor(f); String extracted = extractor.getText(); <font color="#804040"><b>return</b></font> extracted; } <font color="#2e8b57"><b>public</b></font> <font color="#2e8b57"><b>static</b></font> <font color="#2e8b57"><b>void</b></font> main(String[] args) <font color="#2e8b57"><b>throws</b></font> Exception { <font color="#804040"><b>if</b></font> (args.length <= <font color="#ff00ff">0</font>) { System.err.println(<font color="#ff00ff">"ERROR: No filename specified."</font>); <font color="#804040"><b>return</b></font>; } <font color="#804040"><b>for</b></font> (String filename : args) { File f = <font color="#804040"><b>new</b></font> File(filename); System.out.println(extractOneFile(f)); } } }</font> </pre> Full Java 1.6 binaries are [[%ATTACHURL%/openxmlextractor.zip][attached]]; [[http://poi.apache.org/legal.html][Apache POI license details]] apply. Copy the ZIP archive to your TWiki server and unzip it in a directory of your choice. With this tool in place, all we need to do is provide a _stringifier plugin_ to the add-on. This is done by adding a file called =OpenXML.pm= to the =lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins= directory in the TWiki server installation: <pre> <font color="#0000ff"># For licensing info read LICENSE file in the TWiki root.</font> <font color="#0000ff"># This program is free software; you can redistribute it and/or</font> <font color="#0000ff"># modify it under the terms of the GNU General Public License</font> <font color="#0000ff"># as published by the Free Software Foundation; either version 2</font> <font color="#0000ff"># of the License, or (at your option) any later version.</font> <font color="#0000ff">#</font> <font color="#0000ff"># This program is distributed in the hope that it will be useful,</font> <font color="#0000ff"># but WITHOUT ANY WARRANTY; without even the implied warranty of</font> <font color="#0000ff"># MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the</font> <font color="#0000ff"># GNU General Public License for more details, published at </font> <font color="#0000ff"># <a href="http://www.gnu.org/copyleft/gpl.html">http://www.gnu.org/copyleft/gpl.html</a></font> <font color="#804040"><b>package</b></font><font color="#2e8b57"><b> TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyPlugins::OpenXML;</b></font> <font color="#804040"><b>use base</b></font> <font color="#ff00ff">'</font><font color="#ff00ff">TWiki::Contrib::SearchEngineKinoSearchAddOn::StringifyBase</font><font color="#ff00ff">'</font>; <font color="#804040"><b>use </b></font>File::Temp <font color="#ff00ff">qw/</font><font color="#ff00ff">tmpnam</font><font color="#ff00ff">/</font>; __PACKAGE__->register_handler( <font color="#ff00ff">"</font><font color="#ff00ff">application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</font><font color="#ff00ff">"</font>, <font color="#ff00ff">"</font><font color="#ff00ff">.xlsx</font><font color="#ff00ff">"</font>); __PACKAGE__->register_handler( <font color="#ff00ff">"</font><font color="#ff00ff">application/vnd.openxmlformats-officedocument.wordprocessingml.document</font><font color="#ff00ff">"</font>, <font color="#ff00ff">"</font><font color="#ff00ff">.docx</font><font color="#ff00ff">"</font>); __PACKAGE__->register_handler( <font color="#ff00ff">"</font><font color="#ff00ff">application/vnd.openxmlformats-officedocument.presentationml.presentation</font><font color="#ff00ff">"</font>, <font color="#ff00ff">"</font><font color="#ff00ff">.pptx</font><font color="#ff00ff">"</font>); <font color="#804040"><b>sub</b></font><font color="#008080"> </font><font color="#008080">stringForFile</font><font color="#008080"> </font>{ <font color="#804040"><b>my</b></font> (<font color="#008080">$self</font>, <font color="#008080">$file</font>) = <font color="#008080">@_</font>; <font color="#804040"><b>my</b></font> <font color="#008080">$tmp_file</font> = tmpnam(); <font color="#804040"><b>my</b></font> <font color="#008080">$text</font>; <font color="#804040"><b>my</b></font> <font color="#008080">$cmd</font> = <font color="#ff00ff">"</font><font color="#ff00ff">java -jar /www/twiki/local/bin/openxmlextractor/openxmlextractor.jar '</font><font color="#008080">$file</font><font color="#ff00ff">' > </font><font color="#008080">$tmp_file</font><font color="#ff00ff">"</font>; <font color="#804040"><b>if</b></font> (<font color="#ff00ff">0</font> == <font color="#804040"><b>system</b></font>(<font color="#008080">$cmd</font>)) { <font color="#008080">$text</font> = TWiki::Contrib::SearchEngineKinoSearchAddOn::Stringifier->stringFor(<font color="#008080">$tmp_file</font>); } <font color="#804040"><b>unlink</b></font>(<font color="#008080">$tmp_file</font>); <font color="#804040"><b>return</b></font> <font color="#008080">$text</font>; <font color="#0000ff"># undef signals failure to caller</font> } <font color="#ff00ff">1</font>; </pre> This script assumes that the =openxmlextractor.jar= helper is located at =/www/twiki/local/bin/openxmlextractor=; you'll have to tweak this path to reflect your local settings. I haven't figured out yet how to correctly deal with encodings in the stringifier code, so non-ASCII characters might not work as expected. To verify local installation, change into =/www/twiki/kinosearch/bin= (this is where my TWiki installation is, YMMV) and run the extractor on a test file: <pre> ./ks_test stringify foobla.docx </pre> And in a final step, enable index generation for Office documents by adding =.docx=, =.pptx= and =.xlsx= to the <nop>Main.TWikiPreferences topic: <verbatim> * KinoSearch settings * Set KINOSEARCHINDEXEXTENSIONS = .pdf, .xml, .html, .doc, .xls, .ppt, .docx, .pptx, .xlsx </verbatim> --- %STOPINCLUDE% %COMMENT{type="below" nonotify="on"}% ---
to top
End of topic
Skip to action links
|
Back to top
Edit
|
Attach image or document
|
Printable version
|
Raw text
|
Refresh
|
More topic actions
Revisions: | r1.2 |
>
|
r1.1
|
Total page history
|
Backlinks
You are here:
Blog
>
DefinePrivatePublic20090720TWikiKinoSearch
r1.2 - 20 Jul 2009 - 10:13 -
ClausBrod
to top
Blog
This site
2017
:
12
-
11
-
10
2016
:
10
-
7
-
3
2015
:
11
-
10
-
9
-
4
-
1
2014
:
5
2013
:
9
-
8
-
7
-
6
-
5
2012
:
2
-
10
2011
:
1
-
8
-
9
-
10
-
12
2010
:
11
-
10
-
9
-
4
2009
:
11
-
9
-
8
-
7
-
6
-
5
-
4
-
3
2008
:
5
-
4
-
3
-
1
2007:
12
-
8
-
7
-
6
-
5
-
4
-
3
-
1
2006:
4
-
3
-
2
-
1
2005:
12
-
6
-
5
-
4
2004:
12
-
11
-
10
C++
CoCreate Modeling
COM & .NET
Java
Mac
Lisp
OpenSource
Scripting
Windows
Stuff
Changes
Index
Search
Maintenance
Impressum
Datenschutzerklärung
Home
Webs
Atari
Blog
Claus
CoCreateModeling
Klassentreffen
Main
Sandbox
Sommelier
TWiki
Xplm
Jump:
Copyright © 1999-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback