[issue3574] compile() cannot decode Latin-1 source encodings

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 - 3 | Next >

[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


New submission from Brett Cannon <brett@...>:

The following leads to a SyntaxError in 3.0:

  compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec')

That is not the case in Python 2.6.

----------
messages: 71251
nosy: brett.cannon
severity: normal
status: open
title: compile() cannot decode Latin-1 source encodings

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

Looks like Parser/tokenizer.c:check_coding_spec() considered Latin-1 a
raw encoding just like UTF-8. Patch is in the works.

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

Here is a potential fix. It broke test_imp because it assumed that
Latin-1 source files would be encoded at Latin-1 instead of UTF-8 when
returned by imp.new_module(). Doesn't seem like a critical change as the
file is still properly decoded.

----------
keywords: +patch
Added file: http://bugs.python.org/file11130/fix_latin.diff

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

Attached is a test for test_pep3120 (since that is what most likely
introduced the breakage). It's a separate patch since the source file is
marked as binary and thus can't be diffed by ``svn diff``.

----------
components: +Interpreter Core
priority:  -> critical
versions: +Python 3.0
Added file: http://bugs.python.org/file11131/pep3120_test.diff

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


----------
type:  -> behavior

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

Can someone double-check this patch for me? I don't have much experience
with the parser so I want to make sure I am not doing anything wrong.

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

There is a potential dependency on issue3594 as it would change how
imp.find_module() acts and thus make test_imp no longer fail in the way
it has.

----------
dependencies: +PyTokenizer_FindEncoding() never succeeds

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Benjamin Peterson <musiccomposition@...> added the comment:

That line dates back to the PEP 263 implementation. Martin?

----------
nosy: +benjamin.peterson, loewis

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


----------
priority: critical -> release blocker

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


----------
keywords: +needs review

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Martin v. Löwis <martin@...> added the comment:

Since this is marked "release blocker", I'll provide a shallow comment:

I don't think it should be a release blocker. It's a bug in the compile
function, and there are various work-arounds (such as saving the bytes
to a temporary file and executing that one, or decoding the byte string
to a Unicode string, and then compiling the Unicode string). It is
sufficient to fix it in 3.0.1.

I don't think the patch is right: as the test had to be changed, it
means that somewhere, the detection of the encoding declaration now
fails. This is clearly a new bug, but I don't have the time to analyse
the cause further.

In principle, there is nothing wrong with the tokenizer treating latin-1
as "raw" - that only means we don't go through a codec.

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

Actually, the tests don't have to change; if issue 3594 gets applied
then that change cascades into this issue and negates the need to change
the tests themselves.

As for treating Latin-1 as a raw encoding, how can that be theoretically
okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Martin v. Löwis <martin@...> added the comment:

> As for treating Latin-1 as a raw encoding, how can that be theoretically
> okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?

The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all
non-ASCII bytes on to the AST, which then needs to deal with them;
it then could (but apparently doesn't) take into account whether the
internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode
for some remains of that.

The other case (besides string literals) where bytes > 127 matter is
tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but
could be easily extended to support Latin-1 as well).

The third case where non-ASCII bytes are allowed is comments; there
they are entirely ignored (i.e. it is not even verified that the
comment is well-formed UTF-8).

Removal of the special case should simplify the code; I would agree
that any speedup gained by not going through a codec is irrelevant.
I'm still puzzled why test_imp if the special case is removed.

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

The test_imp stuff has to do with PyTokenizer_FindEncoding().
imp.find_module() only opens the file, passes the file descriptor to
PyTokenizer_FindEncoding() and then returns a file object with the found
encoding.

Problem is that (as issue 3594 points out), PyTokenizer_FindEncoding()
always fails. That means it assumes only the raw encodings are okay.
With Latin-1 being one of them, it returns the file opened as Latin-1 as
is correct. Removing that case here means PyTokenizer_FindEncoding()
fails, and thus assumes only UTF-8 as a legitimate encoding and opens
the files with the UTF-8 encoding. It took a while to find these two
bugs obviously. =)

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Brett Cannon <brett@...> added the comment:

I have attached a new version of the patch with the changes to test_imp
removed as issue 3594 fixed the need for the change. I have also
directly uploaded test_pep3120.py since it is flagged as binary and thus
cannot be diffed by svn.

Added file: http://bugs.python.org/file11398/fix_latin.diff

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


Removed file: http://bugs.python.org/file11130/fix_latin.diff

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


Added file: http://bugs.python.org/file11399/test_pep3120.py

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Brett Cannon <brett@...>:


Removed file: http://bugs.python.org/file11131/pep3120_test.diff

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Barry A. Warsaw <barry@...>:


----------
priority: release blocker -> deferred blocker

_______________________________________
Python tracker <report@...>
<http://bugs.python.org/issue3574>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/lists%40nabble.com


[issue3574] compile() cannot decode Latin-1 source encodings

by Weeble-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


Changes by Barry A. Warsaw <barry@...>:


----------
priority: deferred blocker -> release blocker

_______________________________________
Py