Unicode Transliteration & ICU

View: New views
5 Messages — Rating Filter:   Alert me  

Unicode Transliteration & ICU

by David Arakelian :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I have been told that PHP6 has implemented transliteration in some form,
but no specifics were given.

Does anyone know anything about transliteration in PHP6. I have noticed
that PHP has an ICU extension. ICU has a very comprehensive
transliteration/transform module that is not documented.

Currently I am using iconv and a PLEC extension to transliterate, but
they area neither comprehensive or widely supported.

There is also another method you can use to do transliteration, which
invloves NFD normalisation, but this is a very poor option.

--
      ,'/:.          David Arakelian
    ,'-/::::.        http://www.theatons.com/
  ,'--/::(@)::.      Web Developer
,'---/::::::::::.    Wales
____/:::::::::::::.  
  T H E A T O N S  


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Unicode Transliteration & ICU

by Darren Cook :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Does anyone know anything about transliteration in PHP6. I have noticed
> that PHP has an ICU extension. ICU has a very comprehensive
> transliteration/transform module that is not documented.

It is documented here:
  http://www.icu-project.org/userguide/Transform.html

(But I don't think Transform in the PHP intl extension?)

No Arabic support, which is the transliteration code I'm working on at
the moment (in native PHP; it'll be in the next (MIT open-source) fclib
library release).

I'm also not sure the Japanese one will be useful, as it sounds like
they do things slightly differently from normal Hepburn romaji to allow
the conversions to be reversible. (which also suggests they don't
transliterate the katakana long vowel but keep it as a hyphen??)

> Currently I am using iconv and a PLEC extension to transliterate, but
> they area neither comprehensive or widely supported.

Which languages are you trying to transliterate for?

Darren


--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Unicode Transliteration & ICU

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> It is documented here:
>   http://www.icu-project.org/userguide/Transform.html
>
> (But I don't think Transform in the PHP intl extension?)

No, not yet.
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Unicode Transliteration & ICU

by Darren Cook :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Thanks for your very informative reply, Darren. I guess that maybe
> PHP6 has implemented this from ICU. I was told by a PECL developer
> that there is something in PHP6 but he didn't elaborate.

The intl extension:
  http://pecl.php.net/package/intl/
You can use it from php 5.2.4 onwards (or 5.2.3 with some
modifications). Also see php|a magazine,Mar 2008.

> The one I am using at the moment is:
> http://derickrethans.nl/translit.php

Thanks, I'd not heard of that. The Chinese conversion seems to be done
by a huge lookup table, which is interesting.

> Your work sounds interesting. I have downloaded your library, but am
> having trouble navigating through it.

Yes, fclib is quite informal :-).

> What files should I be looking at for the transliteration?

utf8.inc, e.g. fclib_katakana_to_hepburn_romaji().
See also my articles in php|a, Aug and Sep 2007.

> I would like to be able to transliterate absolutely everything in
> unicode. I have no idea if that is unreasonable as I am just getting
> into character sets. I want them to make a bulletproof string to url
> function for search engine friendliness and I also believe it is not
> really a good thing to have high unicode in the url. For example
>
> Héllo Thìs is a URL Ælfred => hello-this-is-a-url-aelfred

If URLs are the only concern I think I'd do this using urlencode(). What
does a transliteration approach gain you?

> Another thing that I started working on was a strtoupper, strtolower
> and ucfirst function for cyrillic and anything else that can be upper
> and lower case. However, being new to character set and unicode I am
> having trouble converting the hex codes to actual character and
> cannot get preg_replace to work with high unicode.

See fclib_utf8_chr() and uniord() in utf8.inc, which are UTF-8 versions
of PHP's chr() and ord() functions.

I'm not sure about using preg as I'm not sure I've done it that way. The
manual http://jp2.php.net/manual/en/regexp.reference.php has a section
on unicode, but still doesn't seem to support giving a 4-character hex
code. Perhaps you just use \x twice in a row? E.g.
  \x06\x28
to match U+0628 (Arabic BEH).

Darren


--
Darren Cook, Software Researcher/Developer
http://dcook.org/mlsn/ (English-Japanese-German-Chinese-Arabic
                        open source dictionary/semantic network)
http://dcook.org/work/ (About me and my work)
http://dcook.org/work/charts/  (My flash charting demos)


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: Unicode Transliteration & ICU

by Andrei Zmievski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The full text transformation support is not there yet, but there is a
simple transliteration function - str_transliterate().

-Andrei

Darren Cook wrote:

>> Does anyone know anything about transliteration in PHP6. I have noticed
>> that PHP has an ICU extension. ICU has a very comprehensive
>> transliteration/transform module that is not documented.
>
> It is documented here:
>   http://www.icu-project.org/userguide/Transform.html
>
> (But I don't think Transform in the PHP intl extension?)
>
> No Arabic support, which is the transliteration code I'm working on at
> the moment (in native PHP; it'll be in the next (MIT open-source) fclib
> library release).
>
> I'm also not sure the Japanese one will be useful, as it sounds like
> they do things slightly differently from normal Hepburn romaji to allow
> the conversions to be reversible. (which also suggests they don't
> transliterate the katakana long vowel but keep it as a hyphen??)
>
>> Currently I am using iconv and a PLEC extension to transliterate, but
>> they area neither comprehensive or widely supported.
>
> Which languages are you trying to transliterate for?
>
> Darren
>
>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

LightInTheBox - Buy quality products at wholesale price