Bugs item #1908443, was opened at 2008-03-05 19:01
Message generated for change (Comment added) made by jenglish
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 99. Other
Group: obsolete: 8.5.1
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Joe English (jenglish)
Assigned to: Joe English (jenglish)
Summary: Composed characters in UTF-8 locale
Initial Comment:
Observed on Debian Sarge after installing UTF-8 locales: ISO8859-1 characters entered with compose key sequences end up wrong.
Setup: xmodmap -e 'keysym Super_L = Multi_key Super_L'
(this makes the Windows key into a Compose key).
export LC_ALL=en_US.UTF-8
export XMODIFIERS=@im=local
Run wish; verify that [encoding system] is utf-8
Press e.g., <Compose> <c> <comma>.
This should turn into (c-cedilla, \UE7). Instead, it shows up as \UFFE7.
I think I've narrowed this down to sometime between 8.4.12 and 8.4.13. Problem appears to be in Tcl, not Tk.
This looks like improper sign extension.
----------------------------------------------------------------------
>Comment By: Joe English (jenglish)
Date: 2008-07-04 12:19
Message:
Logged In: YES
user_id=68433
Originator: YES
Backported tclEncoding.c fix to core-8-4 and core-8-5 branches. Closing.
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-06-10 18:31
Message:
Logged In: YES
user_id=68433
Originator: YES
Patch#1986818 committed to Tk CVS HEAD and core-8-5 branch, which should
fix the underlying problem.
Patch to generic/tclEncoding.c committed to Tcl CVS HEAD; will backport if
nobody's compiler complains (gcc does not; I suspect MSVC might want
another unnecessary cast or three).
----------------------------------------------------------------------
Comment By: Konstantin Khomoutov (flatworm)
Date: 2008-06-07 07:21
Message:
Logged In: YES
user_id=1350198
Originator: NO
See also 1967075 -- may be they're related.
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-06-06 12:56
Message:
Logged In: YES
user_id=68433
Originator: YES
Proposed fix: Patch#1986818
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-03-05 20:16
Message:
Logged In: YES
user_id=68433
Originator: YES
Sorry, correction to initial report: test was run with LC_ALL=en_US.
Setting LC_ALL=en_US.UTF-8 yields different results, possibly correct.)
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-03-05 20:10
Message:
Logged In: YES
user_id=68433
Originator: YES
Quoth TFM: "The XmbLookupString and XwcLookupString functions return text
in the encoding of the locale bound to the input method of the specified
input context."
It appears that Xlib is using a different set of heuristics to determine
the encoding of a locale than Tcl (and glibc) does. Xlib apparently uses
the table in /usr/lib/X11/locale/locale.alias, while Tcl uses nl_langinfo.
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-03-05 19:48
Message:
Logged In: YES
user_id=68433
Originator: YES
This is one of the nasty surprises in C89: when casting a (signed) char to
an unsigned short, it gets widened to a signed short first, then converted
to an unsigned short. Sign extension happens in the first step.
Changing the line from:
ch = (Tcl_UniChar) *src;
to
ch = (unsigned char) *src;
prevents sign extension and makes things behave as expected. (You don't
need to say "(Tcl_UniChar)(unsigned char)*src", since the usual integral
promotions apply. MSVC might complain though.)
This masks the problem, but does not fix it: the real problem is that
UtfToUtfProc is getting called in the first place.
XmbLookupString() is apparently returning ISO8859-1 text, but Tk believes
this is in the "system" encoding, which is utf-8. IOW, Xlib's idea of "the
system encoding" is different from Tcl's.
More research required.
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-03-05 19:40
Message:
Logged In: YES
user_id=68433
Originator: YES
Specifically, this part:
generic/tclEncoding.c r1.16.2.9 -> r1.16.2.10
@@ -2083,13 +2083,23 @@ UtfToUtfProc(clientData, src, srcLen, flags,
statePtr, dst, dstLen,
*/
*dst++ = 0;
src += 2;
+ } else if (!Tcl_UtfCharComplete(src, srcEnd - src)) {
+ /* Always check before using Tcl_UtfToUniChar. Not doing
+ * can so cause it run beyond the endof the buffer! If we
+ * * happen such an incomplete char its byts are made to *
+ * represent themselves.
+ */
+
+ ch = (Tcl_UniChar) *src;
^^^^^^^^^^^^^^^^^^^ here
+ src += 1;
+ dst += Tcl_UniCharToUtf(ch, dst);
} else {
src += Tcl_UtfToUniChar(src, &ch);
dst += Tcl_UniCharToUtf(ch, dst);
}
}
----------------------------------------------------------------------
Comment By: Joe English (jenglish)
Date: 2008-03-05 19:18
Message:
Logged In: YES
user_id=68433
Originator: YES
`git bisect` narrows it down to this commit:
Author: andreas_kupries <andreas_kupries>
Date: Wed Apr 5 00:05:53 2006 +0000
* generic/tclIO.c (ReadChars): Added check and panic and
commentary to a piece of code which relies on BUFFER_PADDING to
create enough space at the beginning of each buffer forthe
insertion of partial multi-byte data at the beginning of a
buffer. To explain why this code is ok, and as precaution if
someone twiddled the BUFFER_PADDING into uselessness.
* generic/tclIO.c (ReadChars): [SF Tcl Bug 1462248]. Added code
temporarily suppress the use of TCL_ENCODING_END set when eof
was reached while the buffer we are converting is not truly the
last buffer in the queue. together with the Utf bug below it
was
possible to completely bollox the buffer data structures,
eventually crashing Tcl.
* generic/tclEncoding.c (UtfToUtfProc): Fixed problem where the
function accessed memory beyond the end of the input
buffer. When TCL_ENCODING_END is set and the last bytes of the
buffer start a multi-byte sequence. This bug contributed to [SF
Tcl Bug 1462248].
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=112997&aid=1908443&group_id=12997-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at
http://www.sourceforge.net/community/cca08_______________________________________________
Tcl-Bugs mailing list
Tcl-Bugs@...
https://lists.sourceforge.net/lists/listinfo/tcl-bugs