[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

View: New views
15 Messages — Rating Filter:   Alert me  

[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


New submission from roland rehmnert <roland.rehmnert@...>:

xml text fields are not read properly when it is encountered in a
'start' event.

During a 'start'-event elem.text returns None, if the text string cross
a page boundary of the file. (this is platform dependent and a typical
value is 8K (8192 byte)).  



This line cause an error if the page size is 8192.
<a>this is a text where X has position 8192 in the file</a>

In most cases this erroneous behaviour can be avoid when elem.tree
always returns the proper value at the 'end'-event.  


Two files are submitted:
bug.py: An excerpted file that produced an error with the submitted xml
file.

bug.xml: An xml file, a little bit more then 8200 bytes. In can of the
page size is greater than 8K.. file should be enlarged. Important is
however that the text should cross the page boundary. Tags and
attributes and attribute values as well are OK

 
I might have misunderstood the documentation of etree, because there are
situations that I have not tested.
/roland

----------
components: Library (Lib)
messages: 74635
nosy: roland
severity: normal
status: open
title: xml.etree.ElementTree does not read xml-text over page bonderies
type: behavior
versions: Python 2.5

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by roland rehmnert <roland.rehmnert@...>:


Added file: http://bugs.python.org/file11762/bug.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by roland rehmnert <roland.rehmnert@...>:


Added file: http://bugs.python.org/file11763/bug.xml

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hirokazu Yamamoto <ocean-city@...> added the comment:

Minimum script to reproduce this issue is "bug.py" I've attached.
And I think this issue can be fixed with
"fix_cross_boundary_on_ElementTree.patch". I'll attach the test case for
this issue as "test.py". (I wanted to intergrate test into
test_xml_etree_c.py, but it uses doctest which I don't know about)

/////////////////////////
// Cause of issue

TreeBuilder#start() and TreeBuilder#end() are handlers driven by
self._parser.feed(data) in iterparse.next(), and iterparse stores
elements returned by these functions.

But element is not initialized at the moment. No one can determine
element.text when start tag is found, and element.tail when end tag is
found vise versa. We can say "the element is initialized" when
encountered next element or TreeBuilder is closed.

So, iterparse's _events queue may contain uninitialized elements, so my
patch waits until the element will be initialized.

----------
components: +XML
nosy: +ocean-city
versions: +Python 2.6, Python 3.0
Added file: http://bugs.python.org/file11764/bug.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


----------
keywords: +patch
Added file: http://bugs.python.org/file11765/fix_cross_boundary_on_ElementTree.patch

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


Added file: http://bugs.python.org/file11766/test.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


Removed file: http://bugs.python.org/file11765/fix_cross_boundary_on_ElementTree.patch

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


Added file: http://bugs.python.org/file11769/fix_cross_boundary_on_ElementTree_v2.patch

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


Removed file: http://bugs.python.org/file11766/test.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


Added file: http://bugs.python.org/file11770/test_v2.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


roland rehmnert <roland.rehmnert@...> added the comment:

We had to be careful how we should handle this.

http://effbot.org/zone/element-iterparse.htm

A note on this site says following :

Note: The tree builder and the event generator are not necessarily
synchronized; the latter usually lags behind a bit. This means that when
you get a “start” event for an element, the builder may already have
filled that element with content. You cannot rely on this, though — a
“start” event can only be used to inspect the attributes, not the
element content. For more details, see this
<ref>http://mail.python.org/pipermail/xml-sig/2005-January/010838.html</ref>.

I do understand that it might be so that elem.text is undefined at start.

I have not investigated how iterparse handle this situation over boundaries:

<a> text <b> text </b> text </a>

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


----------
assignee:  -> effbot
nosy: +effbot

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Fredrik Lundh <effbot@...> added the comment:

Roland's right - "iterparse" only guarantees that it has seen the ">"
character of a starting tag when it emits a "start" event, so the
attributes are defined, but the contents of the text and tail attributes
are undefined at that point.  The same applies to the element children;
they may or may not be present.

If you need a fully populated element, look for "end" events instead.

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Hirokazu Yamamoto <ocean-city@...>:


----------
resolution:  -> invalid
status: open -> closed

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue4100] xml.etree.ElementTree does not read xml-text over page bonderies

by STINNER Victor :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Hirokazu Yamamoto <ocean-city@...> added the comment:

I propose to note this behavior on document. I'll attach the patch. (I
just inserted your comment into document)

----------
components: +Documentation -Library (Lib), XML
resolution: invalid ->
status: closed -> open
Added file: http://bugs.python.org/file11925/ElementTree_iterparse_doc.patch

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue4100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com

LightInTheBox - Buy quality products at wholesale price!