The size of a character

View: New views
13 Messages — Rating Filter:   Alert me  

The size of a character

by Larry W. Virden-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 
Just in the past few days, a developer contacted me about a production problem. They were processing some unicode data in a tcl script, using tdom. The file contained 𝒜 - a script-A, which is 0x1D49C. Tdom when hitting this character replied:

"tcldom_AppendEscaped: can only handle UTF-8 chars up to 3 bytes length"

I contacted the tdom author, who tells me this is because the default Tcl only handles 3 byte unicode.

With the increasing use of Unicode around the world, and Tcl having one of the premier libraries to manipulate such data, are there any technical reasons not to extend the Tcl support from 3 byte Unicode to whatever the next level of of Unicode character size might be?

I've been told that one could always create a custom version of Tcl, modifying the define in tcl.h (I believe). However, I was just wondering what the consequences of that might be. If everything is certain to work, then what forces are at play that would prevent Tcl from shipping with the value set higher?

I'm trying to figure out what our next step should be. Tell authors and publishers they can't use all of Unicode? Write the code in some other language, or at least, some sort of pre-filter that encodes the bigger characters?

This latter of course is not on topic for this list. I'll do work on that elsewhere. The on-topic question is those first few of the message - about the impact we should expect and what might be preventing the shipping Tcl distribution from using larger bytes.



--
Tcl - The glue of a new generation.   http://wiki.tcl.tk/
Larry W. Virden http://www.purl.org/net/lvirden/
http://www.xanga.com/lvirden/
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Parent Message unknown Re: The size of a character

by Larry W. Virden-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



Subject: [TCLCORE] The size of a character
To: tcl-core@...
Message-ID:
       <5868906b0807081228m2e409abfwf0acff43419fc71e@...>
Content-Type: text/plain; charset="iso-8859-1"

Just in the past few days, a developer contacted me about a production
problem. They were processing some unicode data in a tcl script, using tdom.
The file contained &Ascr; - a script-A, which is 0x1D49C. Tdom when hitting
this character replied:

"tcldom_AppendEscaped: can only handle UTF-8 chars up to 3 bytes length"
 
 
I want to apologize. I misquoted the tdom author, due to some sort of misunderstanding with regard to the true nature of the issue. See http://tech.groups.yahoo.com/group/tdom/message/1864 for the specifics of the response as well as http://tech.groups.yahoo.com/group/tdom/message/1866 for a subsequent response when I was asking about opinions on the impact on Tcl to raise the limits.
 
From the sound of it, this is something that would only be done at a major version change (due to the underlying assumptions that are changed) - am I correct? Has anyone considered this as useful for Tcl 9.0?
 
Just curious. For people working in the publishing industry (and perhaps other segments), full unicode support would be very beneficial.
 
 

--
Tcl - The glue of a new generation.   http://wiki.tcl.tk/
Larry W. Virden http://www.purl.org/net/lvirden/
http://www.xanga.com/lvirden/
Even if explicitly stated to the contrary, nothing in this posting
should be construed as representing my employer's opinions.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Donal K. Fellows-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Larry W. Virden wrote:

> I want to apologize. I misquoted the tdom author, due to some sort of
> misunderstanding with regard to the true nature of the issue. See
> http://tech.groups.yahoo.com/group/tdom/message/1864 for the specifics
> of the response as well as
> http://tech.groups.yahoo.com/group/tdom/message/1866 for a subsequent
> response when I was asking about opinions on the impact on Tcl to raise
> the limits.
>  
>  From the sound of it, this is something that would only be done at a
> major version change (due to the underlying assumptions that are
> changed) - am I correct? Has anyone considered this as useful for Tcl 9.0?
>  
> Just curious. For people working in the publishing industry (and perhaps
> other segments), full unicode support would be very beneficial.

This is just my off-the-cuff opinion, but it should be possible to
support UNICODE outside the BMP by using surrogate pairs. Yes, it does
mean that we'd lose the [string length $anyChar]==1 property, but it
would be fairly straight-forward to implement otherwise as it is just a
change to the encoding system. As long as high-UNICODE are infrequent,
that should be a reasonable compromise.

Longer-term, increasing the size of Tcl_UniChar to 4 bytes is probably
the only way to solve this. That's a Tcl-9 thing for sure, as it is a
binary-incompatible change, but it's more feasible these days as memory
sizes have increased a lot since the inception of Tcl 8.1.

Or am I off-base here?

Donal.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Kevin Kenny-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Donal K. Fellows wrote:

> This is just my off-the-cuff opinion, but it should be possible to
> support UNICODE outside the BMP by using surrogate pairs. Yes, it does
> mean that we'd lose the [string length $anyChar]==1 property, but it
> would be fairly straight-forward to implement otherwise as it is just a
> change to the encoding system. As long as high-UNICODE are infrequent,
> that should be a reasonable compromise.
>
> Longer-term, increasing the size of Tcl_UniChar to 4 bytes is probably
> the only way to solve this. That's a Tcl-9 thing for sure, as it is a
> binary-incompatible change, but it's more feasible these days as memory
> sizes have increased a lot since the inception of Tcl 8.1.
>
> Or am I off-base here?

Dead on ... except for the "only way" bit.

I'd much prefer the approach of using UTF-8 wherever possible, and
reserving the Tcl_UniChar stuff for interfaces that unquestionably
require it.  Right now, the Tcl_UniChar string is used mostly because
we have no way to do operations like [string index] and [string range]
in constant (or sublinear) time without it.

But that's fixable. We might consider replacing the UTF-16 string
for most uses with an index that locates the start of some UTF-8
characters in the string. (I say "some" because I'm not yet ready
to commit to "every Nth,", or "at least every Nth", or anything
along those lines without a bit more analysis.)  Operations like
[string range], [string index], ... would go through this
data atructure to locate a nearby starting point, and then, in
constant time, locate the substring of interest.  Operations like
[string first], [regexp], ... would, upon finding a match point,
do a reverse lookup in the data atructure to find the character
number corresponding to a nearby byte position and then, in
constant time, locate the precise character position of interest.

The cost of maintaining such a data structure would unquestionably
be considerable. Nevertheless, the cost of shimmering into and
out of the UTF-16 (UCS-2? UCS-4?) internal representation is also
considerable, so the whole thing may just come out in the wash.

--
73 de ke9tv/2, Kevin

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Frédéric Bonnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kevin Kenny wrote:

> Donal K. Fellows wrote:
>> Longer-term, increasing the size of Tcl_UniChar to 4 bytes is probably
>> the only way to solve this. That's a Tcl-9 thing for sure, as it is a
>> binary-incompatible change, but it's more feasible these days as memory
>> sizes have increased a lot since the inception of Tcl 8.1.
>>
>> Or am I off-base here?
>
> Dead on ... except for the "only way" bit.
>
> I'd much prefer the approach of using UTF-8 wherever possible, and
> reserving the Tcl_UniChar stuff for interfaces that unquestionably
> require it.  Right now, the Tcl_UniChar string is used mostly because
> we have no way to do operations like [string index] and [string range]
> in constant (or sublinear) time without it.
>
> But that's fixable. We might consider replacing the UTF-16 string
> for most uses with an index that locates the start of some UTF-8
> characters in the string. (I say "some" because I'm not yet ready
> to commit to "every Nth,", or "at least every Nth", or anything
> along those lines without a bit more analysis.)  Operations like
> [string range], [string index], ... would go through this
> data atructure to locate a nearby starting point, and then, in
> constant time, locate the substring of interest.  Operations like
> [string first], [regexp], ... would, upon finding a match point,
> do a reverse lookup in the data atructure to find the character
> number corresponding to a nearby byte position and then, in
> constant time, locate the precise character position of interest.
>
> The cost of maintaining such a data structure would unquestionably
> be considerable. Nevertheless, the cost of shimmering into and
> out of the UTF-16 (UCS-2? UCS-4?) internal representation is also
> considerable, so the whole thing may just come out in the wash.

Or better, use a rope structure instead of flat strings. In addition to
allowing fast insertion, removal and slicing, ropes could be made of
string nodes with distinct UCS encoding forms. This would make byte
arrays practically useless for string operations, and hence remove one
of the most common cases of shimmering.

This is the area I'm currently working on for Cloverfield. But my work
could also be used as a replacement for the flat strings in the current
core, with the help of the right utility procs of course (in the same
way as when Tcl8.1 introduced UTF-8).

Don't ask for code yet, I'm still in the pen and paper phase. But I can
give information about my design if people are interested.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Donal K. Fellows-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Kevin Kenny wrote:
> Dead on ... except for the "only way" bit.
[...]
> But that's fixable.
[...]

To bring this back to brass tacks, what is the likelihood of your more
complex data structure being deployed within Tcl 8.6? If the chances are
not good, the surrogate pair method will remain the best technique for
getting a fix into users' hands sooner rather than later, since (as
Larry reports) we're now seeing real requirements for these sorts of
things. We can do something better later on, but letting "good enough"
wait on "perfect" is a recipe for inaction.

Donal.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Brian Griffin :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Frédéric Bonnet wrote:

> Kevin Kenny wrote:
>  
>> Donal K. Fellows wrote:
>>    
>>> Longer-term, increasing the size of Tcl_UniChar to 4 bytes is probably
>>> the only way to solve this. That's a Tcl-9 thing for sure, as it is a
>>> binary-incompatible change, but it's more feasible these days as memory
>>> sizes have increased a lot since the inception of Tcl 8.1.
>>>
>>> Or am I off-base here?
>>>      
>> Dead on ... except for the "only way" bit.
>>
>> ...
> Or better, use a rope structure instead of flat strings. In addition to
> allowing fast insertion, removal and slicing, ropes could be made of
> string nodes with distinct UCS encoding forms. This would make byte
> arrays practically useless for string operations, and hence remove one
> of the most common cases of shimmering.
>  

In the spirit of letting no good idea go unpunished, ropes sound
amazingly close to the internal data structure of the Text widget.  This
could open up the idea of a -textvariable option for the text widget,
similar to the -listvariable currently on the listbox widget.  Just
gotta figure out what to do with those tags...

-Brian

--
# "Don't be ridiculous. Everyone knows there are no Secret
#  Tcl Illuminati."
#                                         -- Donal Fellows
-------------------------------------------------------------
--                 Mentor Graphics Corp.                   --
-- 8005 SW Boeckman Road                  503.685.7000 tel --
-- Wilsonville, OR 97070 USA              503.685.0921 fax --
-------------------------------------------------------------
-- Technical support ............ mailto:support@... --
-- Sales and marketing info ....... mailto:sales@... --
-- Licensing .................... mailto:license@... --
-- Home Page ........................ http://www.model.com --
-------------------------------------------------------------


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Frédéric Bonnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Donal K. Fellows wrote:
> To bring this back to brass tacks, what is the likelihood of your more
> complex data structure being deployed within Tcl 8.6?

Of course it depends on the timeframe for Tcl 8.6, but I think that I
can finish the design and get a working implementation within a couple
of months (as a background, free time task).

But the main concern is compatibility on the API level, as changing the
format of the string rep would impact virtually all existing code. A
possible tradeoff would be to convert ropes to flat strings (a
potentially expensive operation for large strings) whenever the old
rope-unaware API is used, whereas newer rope-aware code would have to
use the new API for maximum performance. I think that plugging the code
into the core would take a similar effort. Converting existing code to
the new API could then be done in an incremental fashion.

FYI my design will be inspired by Boehm's C Cords, which have limited
forward compatibility with plain C strings. See:

http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Joe English-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


fbonnet wrote:
> Of course it depends on the timeframe for Tcl 8.6, but I think that I
> can finish the design and get a working implementation within a couple
> of months (as a background, free time task).
>
> But the main concern is compatibility on the API level, as changing the
> format of the string rep would impact virtually all existing code.

Changing the format of the string rep is a nonstarter.
It would break everything.

However, you don't need to change the format of the string rep.
You can use ropes for the *internal* representation.
IOW, this wouldn't replace the string rep -- it would replace
the tclStringType Tcl_ObjType.


--Joe English

  jenglish@...

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Donal K. Fellows-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Frédéric Bonnet wrote:
> Of course it depends on the timeframe for Tcl 8.6,

See TIP#311.

> but I think that I
> can finish the design and get a working implementation within a couple
> of months (as a background, free time task).

That would be *really* tight. Any slippage and you'd miss 8.6b1, and
you'll be missing a2 for sure. By comparison, adapting the encoding
converters can be done in a few days as the change is very localized.

Donal.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Frédéric Bonnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Donal K. Fellows wrote:
> Frédéric Bonnet wrote:
>> Of course it depends on the timeframe for Tcl 8.6,
>
> See TIP#311.

Thx!

>> but I think that I can finish the design and get a working
>> implementation within a couple of months (as a background, free time
>> task).
>
> That would be *really* tight. Any slippage and you'd miss 8.6b1, and
> you'll be missing a2 for sure. By comparison, adapting the encoding
> converters can be done in a few days as the change is very localized.

So be it. OTOH this means that I can polish my design and implementation
without compatibility or time constraints. But nothing prevents
targetting Tcl 8.7 ;-)

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Frédéric Bonnet :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Joe English wrote:
> Changing the format of the string rep is a nonstarter.
> It would break everything.
>
> However, you don't need to change the format of the string rep.
> You can use ropes for the *internal* representation.
> IOW, this wouldn't replace the string rep -- it would replace
> the tclStringType Tcl_ObjType.

OTOH it would defeat one of the main purposes, that is, shimmering
avoidance.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core

Re: The size of a character

by Donal K. Fellows-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Frédéric Bonnet wrote:
> Joe English wrote:
>> However, you don't need to change the format of the string rep.
>> You can use ropes for the *internal* representation.
>> IOW, this wouldn't replace the string rep -- it would replace
>> the tclStringType Tcl_ObjType.
>
> OTOH it would defeat one of the main purposes, that is, shimmering
> avoidance.

Changing one implementation of the internal representation for another
causes significant problems? How so? (We already have the tclStringType
type. Have done for ages now.)

Donal.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Tcl-Core mailing list
Tcl-Core@...
https://lists.sourceforge.net/lists/listinfo/tcl-core
LightInTheBox - Buy quality products at wholesale price!