|
View:
New views
2 Messages
—
Rating Filter:
Alert me
|
|
|
Microsoft Office mimetype (OLE2) is not recognized reliableHi,
great work the libextractor, I like to learn with it and figure things out, starting to learn python. One problem I noticed: I try to distinguish file formats of the different Microsoft-Office formats using the mimetype information provided by libextractor (I have no filename extansions of the files to investigate). The problem is that often only a general information e.g. "application/vnd.ms-office" are extracted. The result depends on the specific application which has been used at last save of the document/spreadsheet/presentation. I found out that other programms have similar problems to do this job: - In the Linux-Distro Kubuntu Hardy that I use - e.g. XLS-files without filename extension appears as DOC in Konqueror - Windows XP can't do so either (in filemanager) - I also tried NLNZ Metadata Extractor v3.0 without success - The file command on the shell gives wrong application type too Although e.g. OpenOffice can open all the formats without filename extension and imports the correct way (Writer/Calc/Presenter). I use use libextractor 0.5.18a and Python-Extractor 0.5-2. In ChangeLog I didn't found changes regarding OLE2 plugin since 0.5.18a version. Anyone has encountered the same problem? How could this be solved? Best regards, Marc _______________________________________________ libextractor mailing list libextractor@... http://lists.gnu.org/mailman/listinfo/libextractor |
|
|
Re: Microsoft Office mimetype (OLE2) is not recognized reliableOn Saturday 30 August 2008 03:44:15 pm Marc wrote:
> Hi, > > great work the libextractor, I like to learn with it and figure things out, > starting to learn python. > > One problem I noticed: > > I try to distinguish file formats of the different Microsoft-Office > formats using the mimetype information provided by libextractor (I have no > filename extansions of the files to investigate). The problem is that often > only a general information e.g. "application/vnd.ms-office" are extracted. > The result depends on the specific application which has been used at last > save of the document/spreadsheet/presentation. > > I found out that other programms have similar problems to do this job: > - In the Linux-Distro Kubuntu Hardy that I use - e.g. XLS-files without > filename extension appears as DOC in Konqueror > - Windows XP can't do so either (in filemanager) > - I also tried NLNZ Metadata Extractor v3.0 without success > - The file command on the shell gives wrong application type too Well, AFAIK the reason is that to a large extend the document/spreadsheed/presentation format is pretty much the same -- and they all DO have the same mime-type (so it is not incorrect for LE to sometimes report the same mime-type). Internally, LE has one mime-type (vnd.ms-files) which is used if we have no idea what the actual MS application is. If LE is able to determine the "generator", then the MimeType is chosen to be more specific: if (NULL != generator) { const char * mimetype = "application/vnd.ms-files"; if((0 == strncmp(generator, "Microsoft Word", 14)) || (0 == strncmp(generator, "Microsoft Office Word", 21))) mimetype = "application/msword"; else if((0 == strncmp(generator, "Microsoft Excel", 15)) || (0 == strncmp(generator, "Microsoft Office Excel", 22))) mimetype = "application/vnd.ms-excel"; else if((0 == strncmp(generator, "Microsoft PowerPoint", 20)) || (0 == strncmp(generator, "Microsoft Office PowerPoint", 27))) mimetype = "application/vnd.ms-powerpoint"; else if(0 == strncmp(generator, "Microsoft Project", 17)) mimetype = "application/vnd.ms-project"; else if(0 == strncmp(generator, "Microsoft Visio", 15)) mimetype = "application/vnd.visio"; else if(0 == strncmp(generator, "Microsoft Office", 16)) mimetype = "application/vnd.ms-office"; prev = addKeyword(prev, mimetype, EXTRACTOR_MIMETYPE); } One thing you may look at is the "generator" you get for your vnd.ms-files. If it is a specific application that is missing from the above list, we could extend our list. I'm not aware of any alternative / better way to determine the mimetype for MS Office applications. Christian _______________________________________________ libextractor mailing list libextractor@... http://lists.gnu.org/mailman/listinfo/libextractor |
| Free Forum Powered by Nabble | Forum Help |