proposal: unification of the grapheme_extract functions

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I am proposing to unify the three grapheme_extract functions this way:

string grapheme_extract  ( string $haystack  ,
                           int $size
                           [, int $extract_type  
                           [, string $start  ]] )

where $extract_type is:

GRAPHEME_EXTR_COUNT - $size is number of graphemes (default)
GRAPHEME_EXTR_MAXBYTES - $size is maximum number of bytes to extract
GRAPHEME_EXTR_MAXCHARS - $size is maximum number of UTF-8 character to
extract

and the other arguments are as in the current set of extract functions.

Sorry if I missed someone's proposal for this - I am only on the php-i18n
list at this point. Please post your proposal to this list, if possible.

Thanks,

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> I am proposing to unify the three grapheme_extract functions this way:
>
> string grapheme_extract  ( string $haystack  ,
>                            int $size
>                            [, int $extract_type  
>                            [, string $start  ]] )
>
> where $extract_type is:
>
> GRAPHEME_EXTR_COUNT - $size is number of graphemes (default)
> GRAPHEME_EXTR_MAXBYTES - $size is maximum number of bytes to extract
> GRAPHEME_EXTR_MAXCHARS - $size is maximum number of UTF-8 character to
> extract

I think it looks good.

--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> > I am proposing to unify the three grapheme_extract functions this way:
> > ...

The change is checked in.

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> The change is checked in.

Thanks!
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Texin, Tex :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Ed,

 If I use GRAPHEME_EXTR_MAXBYTES, does it return that exact number of bytes, or either the maximum number of whole unicode characters or whole graphemes that can be extracted without exceeding the max bytes?

I assume it is the max # of whole graphemes that do not exceed the max bytes.
The use case is to have a limited storage area for unicode text such as a filename that is limited to 8 bytes, or mail with a subject heading limited to 60 (pick a number) bytes. It is needed to make the string a proper unicode string that displays meaningfully. So if I extract from a larger field, I want the max number of graphemes that fit in my storage space.

Is that what it does?

Also, the $start value is that in byte, character or grapheme units for each of the types?

tex


> -----Original Message-----
> From: Ed Batutis [mailto:ed@...]
> Sent: Friday, May 02, 2008 2:49 PM
> To: php-i18n@...
> Subject: [PHP-I18N] proposal: unification of the
> grapheme_extract functions
>
> Hi,
>
> I am proposing to unify the three grapheme_extract functions this way:
>
> string grapheme_extract  ( string $haystack  ,
>                            int $size
>                            [, int $extract_type  
>                            [, string $start  ]] )
>
> where $extract_type is:
>
> GRAPHEME_EXTR_COUNT - $size is number of graphemes (default)
> GRAPHEME_EXTR_MAXBYTES - $size is maximum number of bytes to
> extract GRAPHEME_EXTR_MAXCHARS - $size is maximum number of
> UTF-8 character to extract
>
> and the other arguments are as in the current set of extract
> functions.
>
> Sorry if I missed someone's proposal for this - I am only on
> the php-i18n list at this point. Please post your proposal to
> this list, if possible.
>
> Thanks,
>
> =Ed
>
>
>
> --
> PHP Unicode & I18N Mailing List (http://www.php.net/) To
> unsubscribe, visit: http://www.php.net/unsub.php
>
>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


>  If I use GRAPHEME_EXTR_MAXBYTES, does it return ...

> I assume it is the max # of whole graphemes that do not exceed the max
> bytes.

Yes. It works just like the old grapheme_extractb.

> Also, the $start value is that in byte, character or grapheme units for
> each of the types?

The start value is always bytes. I was unsure if this made sense, really,
but it is consistent (and easy to implement).

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Texin, Tex :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Thanks Ed. I remember the discussion now.
Personally I don't think it makes sense.
It is an option that should be offered, because it is good for performance, but it is more tedious programming and harder to migrate programs to use this functionality.

The tradeoff is like this:

Let's say a program is on the third character in a string. Today the program knows it is at an index of 3.
In a multibyte world if the start value is the character count the first thing it does is scan the string to find the byte offset where the 3rd character begins.
However, it is likely that this same byte position is known from immediately prior work on the string. So passing byte length around saves frequent rescanning of the string.
An important caveat is that if the string is modified the byte counts have to be thrown away, or at least those after the string is modified.

On the other hand, most existing code is doing character count arithmetic and changing it means replacing simple indexing with functions to get byte offsets.
It is harder to convert the code. It is of course possible to make the intl extension much smarter and remember index to byte mappings, but we didn't have time in the initial version.

$start = 3;
//does stuff at 3 and then wants to do stuff 4 characters after this position.
$extractbegin = $start + 4;
$ext = $substr( $mystr, $extractbegin, $len);

Becomes code that has to:
Call a function to find the byte offset of character 3 in the string (by scanning).
Needs 2 variables to remember both current character count and byte count

Needs to call a function to find the byte offset of character 7 by either scanning from the beginning of the string or starting from the known offset of character 3.
$ext=graphemeextract....

My preference is for start to optionally be grapheme or character count and let the migration be quick and then add optimizations into the extension to recognize strings that are ascii, cache recently used offsets, etc.
But that's just me...

For most programs the performance enhancement of using byte offsets is countered by the extra function calls etc. Especially for the typically short strings.
(Scanning large buffers repeatedly for offsets into the last few characters can hurt, but can usually be worked around thru other optimizations.)

And making the migration difficult will reduce the number of programs that actually support languages that need graphemes...

This wasn't your decision so no reflection on you of course. Next version should add in support for start values to be grapheme counts....
tex


> -----Original Message-----
> From: Ed Batutis [mailto:ed@...]
> Sent: Thursday, May 08, 2008 12:42 PM
> To: Texin, Tex; php-i18n@...
> Subject: RE: [PHP-I18N] proposal: unification of the
> grapheme_extract functions
>
>
> >  If I use GRAPHEME_EXTR_MAXBYTES, does it return ...
>
> > I assume it is the max # of whole graphemes that do not
> exceed the max
> > bytes.
>
> Yes. It works just like the old grapheme_extractb.
>
> > Also, the $start value is that in byte, character or grapheme units
> > for each of the types?
>
> The start value is always bytes. I was unsure if this made
> sense, really, but it is consistent (and easy to implement).
>
> =Ed
>
>
>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> It is an option that should be offered, because it is good for
> performance, but it is more tedious programming and harder to migrate
> programs to use this functionality.

I understand what you are saying. However, everyone should move towards
using break iterator, I believe, for performance reasons!
 
I could add more options to the extract call to allow the user to specify
what $start means:

GRAPHEME_EXTR_START_BYTE_COUNT
GRAPHEME_EXTR_START_CHAR_COUNT
GRAPHEME_EXTR_START_GRAPHEME_COUNT
 
The next question would be - what is the default? Should it be 'byte count'
in all cases, or should it match the $extract_type - graphemes if 'count',
bytes if 'max bytes' etc. I suspect the latter, if I understand your
use-case.

Thoughts?
 
=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> I could add more options to the extract call to allow the user to specify
> what $start means:
>
> GRAPHEME_EXTR_START_BYTE_COUNT
> GRAPHEME_EXTR_START_CHAR_COUNT
> GRAPHEME_EXTR_START_GRAPHEME_COUNT

I'm afraid that might be an overkill - if you need to start at N
graphemes, why not do grapheme_substr then?
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I'm afraid that might be an overkill - if you need to start at N
> graphemes, why not do grapheme_substr then?

I agree - grapheme_extract starting at N graphemes and returning a count of
graphemes is exactly the same functionality as grapheme_substr, but the
other combinations are not covered elsewhere, I think. I cannot think of a
solid basis for excluding any of the others. And I don't know how to change
the API to make that particular combination 'special' so it can be excluded.

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> I agree - grapheme_extract starting at N graphemes and returning a count of
> graphemes is exactly the same functionality as grapheme_substr, but the
> other combinations are not covered elsewhere, I think. I cannot think of a
> solid basis for excluding any of the others. And I don't know how to change
> the API to make that particular combination 'special' so it can be excluded.

Doesn't mb_substr implement the character stuff?
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>
> Doesn't mb_substr implement the character stuff?

Yes, that covers a character offset and a character count to return. mb
calls don't know anything about graphemes, of course. (At one point I
considered adding a 'grapheme mode' to the mb API though.) grapheme_extract
is different, I think, because it is bridging graphemes and
bytes/characters. I don't know of an mb function like that - there's no
mb_extract.

=Ed


--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> considered adding a 'grapheme mode' to the mb API though.) grapheme_extract
> is different, I think, because it is bridging graphemes and
> bytes/characters. I don't know of an mb function like that - there's no
> mb_extract.

Maybe I just misunderstand the use case for the extract function - what
it's supposed to do that substr, mb_substr and grapheme_substr can't or
do worse?
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Maybe I just misunderstand the use case for the extract function - what
> it's supposed to do that substr, mb_substr and grapheme_substr can't or
> do worse?

Tex could probably answer this better than I could, but I'll have a go.

Use case 1: You have a buffer that is a fixed number of bytes long. You need
to fill it up as far as you can with whole graphemes. You are probably
sending that buffer to another API that might not be grapheme - or even
Unicode - aware. You are in a loop so you are tracking your position in the
original string. This is how the discussion got started about how the
'start' parameter is defined - it isn't clear how the position would be
tracked. I assumed a byte count because the user can simply do a strlen on
the return string to update his position, but Tex thinks this isn't as handy
as it should be. It depends on the details of the algorithm I guess.

Use case 2: Same as above except in this case it is an Oracle database
buffer where your columns are defined as being N Unicode characters (not
bytes or graphemes) long.

Use case 3 (a generalization of use case 1 really): You have some code that
knows about bytes or Unicode characters but nothing about graphemes. You
want to update the code so it is grapheme aware. You can't completely
abandon a byte count or character count in the code for some reason, but you
want to easily update the code to process whole graphemes.


=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: proposal: unification of the grapheme_extract functions

by Stanislav Malyshev :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi!

> Tex could probably answer this better than I could, but I'll have a go.

OK, thanks! Picture is much clearer for me now. If you do it in a loop,
then I think bytes should be enough, since you'd have to do
strlen/grapheme_strlen in any case to know how much did you receive, so
doing strlen is no worse than doing anything else, and you could always
work with bytes there, and since grapheme_extract would always stop on
grapheme boundary, I think you don't need to worry about bytes being not
good enough for graphemes inside the string.

As an alternative, we could update $start with new "position" - i.e. old
$start+how many bytes we returned, but I'm not sure if it's the best way.

Tex, what do you think?
--
Stanislav Malyshev, Zend Software Architect
stas@...   http://www.zend.com/
(408)253-8829   MSN: stas@...

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Texin, Tex :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I disagree with case 2 as it is described. You don't want to truncate in the middle of a grapheme, if you in fact have graphemes.

Basically and ideally, there should be only 3 use cases:

A) You are working with graphemes, and ideally you would program with grapheme indexes and counts (start and length in grapheme units).

B) Because of fixed width buffers, you need to specify a max length in bytes, but the function should only extract whole graphemes (start in graphemes, length in bytes).

C) You are not working with text but bytes (which really shouldn't be in this discussion, but for completeness...) and so start and length in bytes.

But we don't live in an ideal world.
Grapheme based processing is more expensive than character processing, and this is more expensive than byte processing.

If this weren't so, we would only have to deal with A, B, and C and programming would be simpler.

It is a bad assumption for i18n, but if you are not dealing with Indic or Middle Eastern languages, then you know you have characters not graphemes, and so why pay the cost?
Also, graphemes are newly supported, so existing code is character based.

Therefore we need to offer the character support.  Analagous to A and B, we need CA (start and length in character units) and CB (start in character units and length in bytes)

But again performance rears its ugly head. Having start be in character or grapheme units, means the function always scans thru start number of units to find the beginning offset. Hence the desire to offer start position in bytes, giving us a version of A and CA that starts with bytes, and a version of B and CB that specifies start and length in bytes but returns a whole number of graphemes or characters, as appropriate.

The final ugliness is we have some of these functions in the plain (or non-mb) flavor and the mb_string flavor.
So, we could say for graphemes use grapheme_substr, and for character use mb functions, and for bytes use the plain functions (or the other way around (I think mb overloads the plain with the character based and provides the byte versions in the mb form... I always have to look to check.)

But, some of the mb functions are not implemented well so I don't trust them, which you can chalk up to my personal idiosyncrasy. The more salient point is it is confusing for people to have to sort thru all the function flavors with different names. I would prefer to have the choices in one function with options and an explanation of when to use what, perhaps derived from the above logic. And I would deprecate the related functions in mb and plain.

That said, if this is all that's holding up the release, I would release with the byte start and add the other flavors in the next version.
People can always use grapheme_length/mb_length(or whatever it is) to get the starting byte position and perhaps write their own function to calculate the byte start and call the grapheme_substr function.
It is a nuisance but if they understand that they can migrate easily.

Let's wrap this up.


tex


> -----Original Message-----
> From: Ed Batutis [mailto:ed@...]
> Sent: Monday, May 12, 2008 1:01 PM
> To: 'Stanislav Malyshev'
> Cc: Texin, Tex; php-i18n@...
> Subject: RE: [PHP-I18N] proposal: unification of the
> grapheme_extract functions
>
> > Maybe I just misunderstand the use case for the extract function -
> > what it's supposed to do that substr, mb_substr and grapheme_substr
> > can't or do worse?
>
> Tex could probably answer this better than I could, but I'll
> have a go.
>
> Use case 1: You have a buffer that is a fixed number of bytes
> long. You need to fill it up as far as you can with whole
> graphemes. You are probably sending that buffer to another
> API that might not be grapheme - or even Unicode - aware. You
> are in a loop so you are tracking your position in the
> original string. This is how the discussion got started about
> how the 'start' parameter is defined - it isn't clear how the
> position would be tracked. I assumed a byte count because the
> user can simply do a strlen on the return string to update
> his position, but Tex thinks this isn't as handy as it should
> be. It depends on the details of the algorithm I guess.
>
> Use case 2: Same as above except in this case it is an Oracle
> database buffer where your columns are defined as being N
> Unicode characters (not bytes or graphemes) long.
>
> Use case 3 (a generalization of use case 1 really): You have
> some code that knows about bytes or Unicode characters but
> nothing about graphemes. You want to update the code so it is
> grapheme aware. You can't completely abandon a byte count or
> character count in the code for some reason, but you want to
> easily update the code to process whole graphemes.
>
>
> =Ed
>
>
>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Ed Batutis :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> I disagree with case 2 as it is described. You don't want to truncate in
> the middle of a grapheme, if you in fact have graphemes.

I didn't intend to say that - the only difference between 1 and 2 is that in
2 the buffer is a character-length buffer and presumably you'd have a
character index that you'd like to use in $start. But grapheme_extract
always returns whole graphemes regardless of any option or there's no point
to it.

Stas brought up the idea of having $start be a reference so the routine
could update it to the next position. I think that might solve some problems
in the caller's code. $start could still be defined as any of bytes,
characters, or graphemes and it would be updated respecting that. What do
you think? If we do that, the user might be perfectly happy with only a
"byte flavor" of $start in many simple cases since they don't need to do
anything extra to iterate through the original string - they can always get
a grapheme count or character count if they need it by making a function
call.

=Ed



--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


RE: proposal: unification of the grapheme_extract functions

by Texin, Tex :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,
On $start being a reference, I like the idea, especially if we do that consistently for all functions (eventually I guess). (Otherwise, it may cause bugs to have $start change unexpectedly for a single function.)

It does make migration also a little harder as people will need to adjust their code which does not expect $start to change.

A variation of the proposal might be to have the end value be an optional argument at the end of the arg list for returning the end position.
That is easy to migrate and requires a conscious change to update the variable and only require updating it if in fact it will be used.

All in all either approach is fine.

(I am out of the office and don't have the specs in front of me - sorry for not being more precise.)

But it doesn't really fix the fundamental issue with needing to involve php programmers with a byte vs char vs grapheme choice.

The right solution (to my mind) is to have some meta data maintained with strings and store position and other info about the strings to improve performance without involving programmers and letting people program strings without caring about encoding or architecture. That is an opportunity for php 6.




> -----Original Message-----
> From: Ed Batutis [mailto:ed@...]
> Sent: Tuesday, May 13, 2008 8:25 AM
> To: Texin, Tex; 'Stanislav Malyshev'
> Cc: php-i18n@...
> Subject: RE: [PHP-I18N] proposal: unification of the
> grapheme_extract functions
>
>
> > I disagree with case 2 as it is described. You don't want
> to truncate
> > in the middle of a grapheme, if you in fact have graphemes.
>
> I didn't intend to say that - the only difference between 1
> and 2 is that in
> 2 the buffer is a character-length buffer and presumably
> you'd have a character index that you'd like to use in
> $start. But grapheme_extract always returns whole graphemes
> regardless of any option or there's no point to it.
>
> Stas brought up the idea of having $start be a reference so
> the routine could update it to the next position. I think
> that might solve some problems in the caller's code. $start
> could still be defined as any of bytes, characters, or
> graphemes and it would be updated respecting that. What do
> you think? If we do that, the user might be perfectly happy
> with only a "byte flavor" of $start in many simple cases
> since they don't need to do anything extra to iterate through
> the original string - they can always get a grapheme count or
> character count if they need it by making a function call.
>
> =Ed
>
>
>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: