|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
[jira] Created: (TIKA-140) HTML parser unable to extract textHTML parser unable to extract text
----------------------------------- Key: TIKA-140 URL: https://issues.apache.org/jira/browse/TIKA-140 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.2-incubating Reporter: julien nioche Fix For: 0.2-incubating At revision 648732 The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated TIKA-140: ------------------------------- Attachment: 1.html > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Fix For: 0.2-incubating > > Attachments: 1.html > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591616#action_12591616 ] julien nioche commented on TIKA-140: ------------------------------------ This is actually a regression. The same document was successfully processed with tika-0.1-incubating.jar > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Fix For: 0.2-incubating > > Attachments: 1.html > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Assigned: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting reassigned TIKA-140: ---------------------------------- Assignee: Jukka Zitting > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592314#action_12592314 ] julien nioche commented on TIKA-140: ------------------------------------ I had a closer look a the problem and found that it is due to the HTML element having attributes defined in a namespace ("<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">"). Having attributes without explicit namespace works fine (<html lang="en">) > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Updated: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] julien nioche updated TIKA-140: ------------------------------- Attachment: anynamespace.diff Probably not the best way to avoid this issue but at least it works. This patch modifies the HTML parser so that it defines a name space "*" which will match against any name space actually found in the documents. > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html, anynamespace.diff > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632992#action_12632992 ] Sami Siren commented on TIKA-140: --------------------------------- Is there something preventing us from committing this fix as it is, Jukka? > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html, anynamespace.diff > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Commented: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632993#action_12632993 ] Jukka Zitting commented on TIKA-140: ------------------------------------ Sorry for the long silence... No problems, the patch is OK. > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html, anynamespace.diff > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
|
|
[jira] Resolved: (TIKA-140) HTML parser unable to extract text[ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-140. -------------------------------- Resolution: Fixed Resolved in a somewhat different manner in revision 698028. Instead of adding the special "*" wildcard to the XPath matcher, I created a new XHTMLDowngradeHandler decorator class that makes sure that all incoming (X)HTML is uniformly structured. > HTML parser unable to extract text > ----------------------------------- > > Key: TIKA-140 > URL: https://issues.apache.org/jira/browse/TIKA-140 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2-incubating > Reporter: julien nioche > Assignee: Jukka Zitting > Fix For: 0.2-incubating > > Attachments: 1.html, anynamespace.diff > > > At revision 648732 > The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. |
| Free Forum Powered by Nabble | Forum Help |