[jira] Created: (TIKA-140) HTML parser unable to extract text

View: New views
9 Messages — Rating Filter:   Alert me  

[jira] Created: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

HTML parser unable to extract text
-----------------------------------

                 Key: TIKA-140
                 URL: https://issues.apache.org/jira/browse/TIKA-140
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.2-incubating
            Reporter: julien nioche
             Fix For: 0.2-incubating


At revision 648732

The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated TIKA-140:
-------------------------------

    Attachment: 1.html

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591616#action_12591616 ]

julien nioche commented on TIKA-140:
------------------------------------

This is actually a regression. The same document was successfully processed with tika-0.1-incubating.jar


> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting reassigned TIKA-140:
----------------------------------

    Assignee: Jukka Zitting

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592314#action_12592314 ]

julien nioche commented on TIKA-140:
------------------------------------

I had a closer look a the problem and found that it is due to the HTML element having attributes defined in a namespace ("<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">").
Having attributes without explicit namespace works fine (<html lang="en">)


> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

julien nioche updated TIKA-140:
-------------------------------

    Attachment: anynamespace.diff

Probably not the best way to avoid this issue but at least it works. This patch modifies the HTML parser so that it defines a name space "*" which will match against any name space actually found in the documents.

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html, anynamespace.diff
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632992#action_12632992 ]

Sami Siren commented on TIKA-140:
---------------------------------

Is there something preventing us from committing this fix as it is, Jukka?

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html, anynamespace.diff
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


    [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632993#action_12632993 ]

Jukka Zitting commented on TIKA-140:
------------------------------------

Sorry for the long silence... No problems, the patch is OK.

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html, anynamespace.diff
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-140) HTML parser unable to extract text

by JIRA jira@apache.org :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


     [ https://issues.apache.org/jira/browse/TIKA-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-140.
--------------------------------

    Resolution: Fixed

Resolved in a somewhat different manner in revision 698028.

Instead of adding the special "*" wildcard to the XPath matcher, I created a new XHTMLDowngradeHandler decorator class that makes sure that all incoming (X)HTML is uniformly structured.

> HTML parser unable to extract text
> -----------------------------------
>
>                 Key: TIKA-140
>                 URL: https://issues.apache.org/jira/browse/TIKA-140
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2-incubating
>            Reporter: julien nioche
>            Assignee: Jukka Zitting
>             Fix For: 0.2-incubating
>
>         Attachments: 1.html, anynamespace.diff
>
>
> At revision 648732
> The file in attachment is not parsed properly by the current HTML parser which returns an empty string when calling ParseUtils.getStringContent(). Saving the same document as .txt from Firefox gives some text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

LightInTheBox - Buy quality products at wholesale price