|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
Text Extractor issue?Hi;
I have added a huge amount of files in the repository, some of them with the ".sxw" extension and recognized (by sun.net.www.MimeTable and, if without sucess, after by a table of mimetypes in my application) like "application/vnd.sun.xml.writer" and the jcr:mimetype was with this value. Nowadays, I tried to search some documents .sxw by content and they not returned. So, I saw that in the class OpenOfficeTextExtractor that only the mimetypes :"application/vnd.oasis.opendocument.database", "application/vnd.oasis.opendocument.formula", "application/vnd.oasis.opendocument.graphics", "application/vnd.oasis.opendocument.presentation", "application/vnd.oasis.opendocument.spreadsheet", "application/vnd.oasis.opendocument.text" would be recognized and indexed by the extractor, is it true? This means that my application must force the mimetype for some in this list, in the case of extensions that have another mimetype? Is the class able to index such kind of openoffice format? What the solution for my case? (I am thinking about to update the jcr:mimetype to the "application/vnd.oasis.opendocument.text" value and redo the indexes, this would resolve the case by the moment? Editing this...-> I added one file in the sxw format with the jcr:mimetype forced as "application/vnd.oasis.opendocument.text" but it wasn't indexed, so the class realy don't extract text from older sxw openoffice formats, would this be an input in jira? ) |
|
|
Re: Text Extractor issue?Hi!
On Wed, Jul 16, 2008 at 3:14 PM, hsp_ <piccinatto@...> wrote: > I have added a huge amount of files in the repository, some of them with the > ".sxw" extension and recognized (by sun.net.www.MimeTable and, if without > sucess, after by a table of mimetypes in my application) like > "application/vnd.sun.xml.writer" and the jcr:mimetype was with this value. > Nowadays, I tried to search some documents .sxw by content and they not > returned. So, I saw that in the class OpenOfficeTextExtractor that only the > mimetypes :"application/vnd.oasis.opendocument.database", > "application/vnd.oasis.opendocument.formula", > "application/vnd.oasis.opendocument.graphics", > "application/vnd.oasis.opendocument.presentation", > "application/vnd.oasis.opendocument.spreadsheet", > "application/vnd.oasis.opendocument.text" > would be recognized and indexed by the extractor, is it true? > This means that my application must force the mimetype for some in this > list, in the case of extensions that have another mimetype? Is the class > able to index such kind of openoffice format? > What the solution for my case? If the documents with the "application/vnd.sun.xml.writer" can be properly read with the OpenOfficeTextExtractor, we could add them to the list of supported mime-types for that extractor. Could you test that by patching the OOTExtractor and overwriting the old one in your classpath? If that works out, you can submit the patch to JIRA. > (I am thinking about to update the jcr:mimetype to the > "application/vnd.oasis.opendocument.text" value and redo the indexes, this > would resolve the case by the moment?) Yes, this should work, too. Regards, Alex -- Alexander Klimetschek alexander.klimetschek@... |
| Free Forum Powered by Nabble | Forum Help |