Proper handling of unicode strings

View: New views
5 Messages — Rating Filter:   Alert me  

Proper handling of unicode strings

by LCID Fire :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm currently in the process of writing an application which needs to
support unicode - but I'm still a little confused of how to properly
handle it. Maybe someone can help me out here.

First of is it valid for e.g. utf8 strings to assume they are NULL
terminated? Would it be valid to call g_strdup on a utf8 string?

If not (and this is done quite often in the unicode glib part) I assume
I have to add the byte length of a string, right (which will bloat
function declarations)?
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list

Re: Proper handling of unicode strings

by milosz derezynski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Yes an UTF-8 string a NULL-terminated ASCII-compatible string. For all purposes except where you need to read it character-by-character (e.g. Gtk+/Pango "reading" the string to display it), you can just treat it like a normal ASCII string.

2008/7/6 LCID Fire <lcid-fire@...>:
I'm currently in the process of writing an application which needs to
support unicode - but I'm still a little confused of how to properly
handle it. Maybe someone can help me out here.

First of is it valid for e.g. utf8 strings to assume they are NULL
terminated? Would it be valid to call g_strdup on a utf8 string?

If not (and this is done quite often in the unicode glib part) I assume
I have to add the byte length of a string, right (which will bloat
function declarations)?
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list



--
------------
Please note that according to the German law on data retention,
information on every electronic information exchange with me is
retained for a period of six months.
[Bitte beachten Sie, dass dem Gesetz zur Vorratsdatenspeicherung zufolge
jeder elektronische Kontakt mit mir sechs Monate lang gespeichert wird.]
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list

Re: Proper handling of unicode strings

by LCID Fire :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

That's great - simplifies a lot of things. But since one character might
need more space than a gchar is it save to call strlen on that string?

Thanks

Milosz Derezynski wrote:

> Yes an UTF-8 string a NULL-terminated ASCII-compatible string. For all
> purposes except where you need to read it character-by-character (e.g.
> Gtk+/Pango "reading" the string to display it), you can just treat it
> like a normal ASCII string.
>
> 2008/7/6 LCID Fire <lcid-fire@... <mailto:lcid-fire@...>>:
>
>     I'm currently in the process of writing an application which needs to
>     support unicode - but I'm still a little confused of how to properly
>     handle it. Maybe someone can help me out here.
>
>     First of is it valid for e.g. utf8 strings to assume they are NULL
>     terminated? Would it be valid to call g_strdup on a utf8 string?
>
>     If not (and this is done quite often in the unicode glib part) I assume
>     I have to add the byte length of a string, right (which will bloat
>     function declarations)?
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list

Re: Proper handling of unicode strings

by milosz derezynski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It's "safe" in the aforementioned sense, but if you want to properly count characters in the UTF-8 string, you should use g_utf8_strlen() instead.

2008/7/7 LCID Fire <lcid-fire@...>:
That's great - simplifies a lot of things. But since one character might
need more space than a gchar is it save to call strlen on that string?

Thanks

Milosz Derezynski wrote:
> Yes an UTF-8 string a NULL-terminated ASCII-compatible string. For all
> purposes except where you need to read it character-by-character (e.g.
> Gtk+/Pango "reading" the string to display it), you can just treat it
> like a normal ASCII string.
>
> 2008/7/6 LCID Fire <lcid-fire@... <mailto:lcid-fire@...>>:
>
>     I'm currently in the process of writing an application which needs to
>     support unicode - but I'm still a little confused of how to properly
>     handle it. Maybe someone can help me out here.
>
>     First of is it valid for e.g. utf8 strings to assume they are NULL
>     terminated? Would it be valid to call g_strdup on a utf8 string?
>
>     If not (and this is done quite often in the unicode glib part) I assume
>     I have to add the byte length of a string, right (which will bloat
>     function declarations)?
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list



--
------------
Please note that according to the German law on data retention,
information on every electronic information exchange with me is
retained for a period of six months.
[Bitte beachten Sie, dass dem Gesetz zur Vorratsdatenspeicherung zufolge
jeder elektronische Kontakt mit mir sechs Monate lang gespeichert wird.]
_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list

Re: Proper handling of unicode strings

by Chris Vine :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, 7 Jul 2008 12:01:36 +0200
"Milosz Derezynski" <internalerror@...> wrote:

> It's "safe" in the aforementioned sense, but if you want to properly
> count characters in the UTF-8 string, you should use g_utf8_strlen()
> instead.
>
> 2008/7/7 LCID Fire <lcid-fire@...>:
>
> > That's great - simplifies a lot of things. But since one character
> > might need more space than a gchar is it save to call strlen on
> > that string?

It is not just "safe" in the sense described above, but required if you
need to know the byte length (say to allocate storage on the heap).

If you need to know the byte length use strlen().  If you need to know
the number of characters (which will be rare, unless you are thinking of
converting say to UCS-4), then use g_utf8_strlen().  If you want to
iterate over the string then g_utf8_next_char() is handy.

Chris

_______________________________________________
gtk-list mailing list
gtk-list@...
http://mail.gnome.org/mailman/listinfo/gtk-list
LightInTheBox - Buy quality products at wholesale price