|
View:
New views
7 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (TIKA-79) Mime type detection from file header appears to be failing.Mime type detection from file header appears to be failing.
----------------------------------------------------------- Key: TIKA-79 URL: https://issues.apache.org/jira/browse/TIKA-79 Project: Tika Issue Type: Bug Components: general Affects Versions: 0.1-incubator Reporter: Keith R. Bennett Fix For: 0.1-incubator Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: typeFromContentTypeHint = application/vnd.ms-excel typeFromResourceName = application/vnd.ms-excel typeFromHeader = null type = application/vnd.ms-excel typeFromContentTypeHint = text/html typeFromResourceName = text/html typeFromHeader = text/html type = text/html typeFromContentTypeHint = application/vnd.oasis.opendocument.text typeFromResourceName = application/vnd.oasis.opendocument.text typeFromHeader = application/vnd.oasis.opendocument.text type = application/vnd.oasis.opendocument.text typeFromContentTypeHint = application/pdf typeFromResourceName = application/pdf typeFromHeader = application/pdf type = application/pdf typeFromContentTypeHint = application/vnd.ms-powerpoint typeFromResourceName = application/vnd.ms-powerpoint typeFromHeader = null type = application/vnd.ms-powerpoint log4j:WARN No appenders could be found for logger (root). log4j:WARN Please initialize the log4j system properly. typeFromContentTypeHint = application/rtf typeFromResourceName = application/rtf typeFromHeader = null type = application/rtf typeFromContentTypeHint = text/plain typeFromResourceName = text/plain typeFromHeader = null type = text/plain typeFromContentTypeHint = application/msword typeFromResourceName = application/msword typeFromHeader = null type = application/msword typeFromContentTypeHint = application/xml typeFromResourceName = application/xml typeFromHeader = null type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (TIKA-79) Mime type detection from file header appears to be failing.[ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith R. Bennett updated TIKA-79: --------------------------------- Attachment: AutoDetectParser.patch The attached patch file reorganizes the MIME type determination in AutoDetectParser so that it is easier to print out the types found by the various methods, and the logic for choosing the predominant result is confined to a smaller area (assuming I understood the intent correctly, that is). In other words, I found it easier to debug. If you like, I can commit it, minus the print statements. I also found it helpful to comment out the LOG.info() call in MimeTypes.load(). (Is there a better way to disable it, by setting that logger to some kind of null appender or someting like that?) > Mime type detection from file header appears to be failing. > ----------------------------------------------------------- > > Key: TIKA-79 > URL: https://issues.apache.org/jira/browse/TIKA-79 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.1-incubator > Reporter: Keith R. Bennett > Fix For: 0.1-incubator > > Attachments: AutoDetectParser.patch > > > Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: > typeFromContentTypeHint = application/vnd.ms-excel > typeFromResourceName = application/vnd.ms-excel > typeFromHeader = null > type = application/vnd.ms-excel > typeFromContentTypeHint = text/html > typeFromResourceName = text/html > typeFromHeader = text/html > type = text/html > typeFromContentTypeHint = application/vnd.oasis.opendocument.text > typeFromResourceName = application/vnd.oasis.opendocument.text > typeFromHeader = application/vnd.oasis.opendocument.text > type = application/vnd.oasis.opendocument.text > typeFromContentTypeHint = application/pdf > typeFromResourceName = application/pdf > typeFromHeader = application/pdf > type = application/pdf > typeFromContentTypeHint = application/vnd.ms-powerpoint > typeFromResourceName = application/vnd.ms-powerpoint > typeFromHeader = null > type = application/vnd.ms-powerpoint > log4j:WARN No appenders could be found for logger (root). > log4j:WARN Please initialize the log4j system properly. > typeFromContentTypeHint = application/rtf > typeFromResourceName = application/rtf > typeFromHeader = null > type = application/rtf > typeFromContentTypeHint = text/plain > typeFromResourceName = text/plain > typeFromHeader = null > type = text/plain > typeFromContentTypeHint = application/msword > typeFromResourceName = application/msword > typeFromHeader = null > type = application/msword > typeFromContentTypeHint = application/xml > typeFromResourceName = application/xml > typeFromHeader = null > type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.[ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917 ] Chris A. Mattmann commented on TIKA-79: --------------------------------------- Guys: Why don't we put a utility method in MimeUtils to handle this functionality. The purpose of the utility method is to try and sense a mime type using all available options (URL resolution, extension ID, mime magic, etc.) There is currently code in Nutch at: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup See the private String getContentType(String typeName, String url, byte[] data) method at the bottom of the class to see how Nutch does this sort of failsafe mime resolution. Perhaps we should follow similar suit in Tika? Cheers, Chris > Mime type detection from file header appears to be failing. > ----------------------------------------------------------- > > Key: TIKA-79 > URL: https://issues.apache.org/jira/browse/TIKA-79 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.1-incubator > Reporter: Keith R. Bennett > Fix For: 0.1-incubator > > Attachments: AutoDetectParser.patch > > > Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: > typeFromContentTypeHint = application/vnd.ms-excel > typeFromResourceName = application/vnd.ms-excel > typeFromHeader = null > type = application/vnd.ms-excel > typeFromContentTypeHint = text/html > typeFromResourceName = text/html > typeFromHeader = text/html > type = text/html > typeFromContentTypeHint = application/vnd.oasis.opendocument.text > typeFromResourceName = application/vnd.oasis.opendocument.text > typeFromHeader = application/vnd.oasis.opendocument.text > type = application/vnd.oasis.opendocument.text > typeFromContentTypeHint = application/pdf > typeFromResourceName = application/pdf > typeFromHeader = application/pdf > type = application/pdf > typeFromContentTypeHint = application/vnd.ms-powerpoint > typeFromResourceName = application/vnd.ms-powerpoint > typeFromHeader = null > type = application/vnd.ms-powerpoint > log4j:WARN No appenders could be found for logger (root). > log4j:WARN Please initialize the log4j system properly. > typeFromContentTypeHint = application/rtf > typeFromResourceName = application/rtf > typeFromHeader = null > type = application/rtf > typeFromContentTypeHint = text/plain > typeFromResourceName = text/plain > typeFromHeader = null > type = text/plain > typeFromContentTypeHint = application/msword > typeFromResourceName = application/msword > typeFromHeader = null > type = application/msword > typeFromContentTypeHint = application/xml > typeFromResourceName = application/xml > typeFromHeader = null > type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.[ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535919 ] Bertrand Delacretaz commented on TIKA-79: ----------------------------------------- +1 for a utility method as proposed by Chris, that tries several detection methods. > Mime type detection from file header appears to be failing. > ----------------------------------------------------------- > > Key: TIKA-79 > URL: https://issues.apache.org/jira/browse/TIKA-79 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.1-incubator > Reporter: Keith R. Bennett > Fix For: 0.1-incubator > > Attachments: AutoDetectParser.patch > > > Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: > typeFromContentTypeHint = application/vnd.ms-excel > typeFromResourceName = application/vnd.ms-excel > typeFromHeader = null > type = application/vnd.ms-excel > typeFromContentTypeHint = text/html > typeFromResourceName = text/html > typeFromHeader = text/html > type = text/html > typeFromContentTypeHint = application/vnd.oasis.opendocument.text > typeFromResourceName = application/vnd.oasis.opendocument.text > typeFromHeader = application/vnd.oasis.opendocument.text > type = application/vnd.oasis.opendocument.text > typeFromContentTypeHint = application/pdf > typeFromResourceName = application/pdf > typeFromHeader = application/pdf > type = application/pdf > typeFromContentTypeHint = application/vnd.ms-powerpoint > typeFromResourceName = application/vnd.ms-powerpoint > typeFromHeader = null > type = application/vnd.ms-powerpoint > log4j:WARN No appenders could be found for logger (root). > log4j:WARN Please initialize the log4j system properly. > typeFromContentTypeHint = application/rtf > typeFromResourceName = application/rtf > typeFromHeader = null > type = application/rtf > typeFromContentTypeHint = text/plain > typeFromResourceName = text/plain > typeFromHeader = null > type = text/plain > typeFromContentTypeHint = application/msword > typeFromResourceName = application/msword > typeFromHeader = null > type = application/msword > typeFromContentTypeHint = application/xml > typeFromResourceName = application/xml > typeFromHeader = null > type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
Re: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.Chris -
If I understand correctly, we already have what we need in MimeUtils: public String getType(String typeName, String url, byte[] data) { ... } Jukka, should I modify AutoDetectParser to call this method instead of its own? However, the bigger issue is, is the assessment that header based detection fails with certain file types correct? For example, it fails to identify the type of the Powerpoint test document we provide. Do we know which types can and can't be detected? If so, it would be helpful to our users and ourselves to document that information. I could put something together based on my observations, but that would risk being incomplete or incorrect due to different document software versions (e.g. Word). - Keith
|
|
|
[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.[ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535934 ] Jukka Zitting commented on TIKA-79: ----------------------------------- +1, makes sense to push the functionaly back to the MIME framework and also to target the test cases directly there instead of testing with AutoDetectParser. > Mime type detection from file header appears to be failing. > ----------------------------------------------------------- > > Key: TIKA-79 > URL: https://issues.apache.org/jira/browse/TIKA-79 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.1-incubator > Reporter: Keith R. Bennett > Fix For: 0.1-incubator > > Attachments: AutoDetectParser.patch > > > Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: > typeFromContentTypeHint = application/vnd.ms-excel > typeFromResourceName = application/vnd.ms-excel > typeFromHeader = null > type = application/vnd.ms-excel > typeFromContentTypeHint = text/html > typeFromResourceName = text/html > typeFromHeader = text/html > type = text/html > typeFromContentTypeHint = application/vnd.oasis.opendocument.text > typeFromResourceName = application/vnd.oasis.opendocument.text > typeFromHeader = application/vnd.oasis.opendocument.text > type = application/vnd.oasis.opendocument.text > typeFromContentTypeHint = application/pdf > typeFromResourceName = application/pdf > typeFromHeader = application/pdf > type = application/pdf > typeFromContentTypeHint = application/vnd.ms-powerpoint > typeFromResourceName = application/vnd.ms-powerpoint > typeFromHeader = null > type = application/vnd.ms-powerpoint > log4j:WARN No appenders could be found for logger (root). > log4j:WARN Please initialize the log4j system properly. > typeFromContentTypeHint = application/rtf > typeFromResourceName = application/rtf > typeFromHeader = null > type = application/rtf > typeFromContentTypeHint = text/plain > typeFromResourceName = text/plain > typeFromHeader = null > type = text/plain > typeFromContentTypeHint = application/msword > typeFromResourceName = application/msword > typeFromHeader = null > type = application/msword > typeFromContentTypeHint = application/xml > typeFromResourceName = application/xml > typeFromHeader = null > type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Assigned: (TIKA-79) Mime type detection from file header appears to be failing.[ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-79: ------------------------------------- Assignee: Chris A. Mattmann > Mime type detection from file header appears to be failing. > ----------------------------------------------------------- > > Key: TIKA-79 > URL: https://issues.apache.org/jira/browse/TIKA-79 > Project: Tika > Issue Type: Bug > Components: general > Affects Versions: 0.1-incubating > Reporter: Keith R. Bennett > Assignee: Chris A. Mattmann > Fix For: 0.2-incubating > > Attachments: AutoDetectParser.patch > > > Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed. When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found. Note that some of the document types have null for typeFromHeader: > typeFromContentTypeHint = application/vnd.ms-excel > typeFromResourceName = application/vnd.ms-excel > typeFromHeader = null > type = application/vnd.ms-excel > typeFromContentTypeHint = text/html > typeFromResourceName = text/html > typeFromHeader = text/html > type = text/html > typeFromContentTypeHint = application/vnd.oasis.opendocument.text > typeFromResourceName = application/vnd.oasis.opendocument.text > typeFromHeader = application/vnd.oasis.opendocument.text > type = application/vnd.oasis.opendocument.text > typeFromContentTypeHint = application/pdf > typeFromResourceName = application/pdf > typeFromHeader = application/pdf > type = application/pdf > typeFromContentTypeHint = application/vnd.ms-powerpoint > typeFromResourceName = application/vnd.ms-powerpoint > typeFromHeader = null > type = application/vnd.ms-powerpoint > log4j:WARN No appenders could be found for logger (root). > log4j:WARN Please initialize the log4j system properly. > typeFromContentTypeHint = application/rtf > typeFromResourceName = application/rtf > typeFromHeader = null > type = application/rtf > typeFromContentTypeHint = text/plain > typeFromResourceName = text/plain > typeFromHeader = null > type = text/plain > typeFromContentTypeHint = application/msword > typeFromResourceName = application/msword > typeFromHeader = null > type = application/msword > typeFromContentTypeHint = application/xml > typeFromResourceName = application/xml > typeFromHeader = null > type = application/xml -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free Forum Powered by Nabble | Forum Help |