Protocols.HTTP.http_encode_string

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I was just notified (crunch [bug 4560]) that
Protocols.HTTP.http_encode_string doesn't work right for chars wider
than 7 bits:

According to RFCs 3986 (URI) and 3987 (IRI), chars should be utf-8
encoded followed by the http %XX encoding. http_encode_string instead
leaves 8-bit chars unencoded and uses that strange %uXXXX encoding for
wider chars, a form that has no grounds in standards at all as far as
I've been able to tell. (Must say I'm curious where it comes from. A
comment says it's some kind of Safari encoding. My limited googling
suggests that Safari at least nowadays uses the RFC method.)

The corresponding functions in Roxen have been corrected since 4.0.
Encoding/decoding functions are always hazardous to change, so it's
perhaps not an ideal time to do it right now. Otoh it would be rather
nice to have correctly working functions in Pike instead of only in
Roxen. So what do you say about changing it now?

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'd say this _is_ the ideal time to fix it.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Btw, maybe Standards.URI()->http_encode should be fixed at the same
time?  It doesn't seem to encode wide characters at all, and encodes
8-bit characters as %XX (iso-8859-1).

Also, is there a http_decode_string() function somewhere?
Standards.URI()->path et all seems to return the string with escapes
still in it.  Is that the correct behaviour?

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There is a _Roxen.http_decode_string which decodes the %XX escapes
themselves (along with those peculiar %uXXXX) but it doesn't do the
subsequent utf-8 decoding. I was planning on making a
Protocol.HTTP.http_decode() (losing the superfluous "_string" suffix
at the same time) which wraps both together.

It's not entirely safe to assume that any %XX-encoded string is
utf-8-encoded underneath however, as the whole elaborate
"magic_roxen_automatic_charset_variable" system in Roxen shows
(although this is getting better since nonconforming browsers are
starting to get rare). Still, I think Pike modules should allow the
user to choose a different interpretation.

As for Standards.URI.path, it wouldn't be safe to decode all %XX
escapes there since the caller then wouldn't be able to tell a quoted
"/" inside a path segment from a path segment separator.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

As for your first question, Standards.URI.http_encode appears to be
correct since only 7-bit chars are allowed in URIs (it does however
also encode some 8-bit chars that it really doesn't have to do). To
follow the standards accurately, we should rather add a Standards.IRI
(see RFC 3987) which handles wider chars and transformation to/from
URIs.

Same reasoning can be applied to Protocols.HTTP, btw: The http scheme
is only defined for URIs and hence simply can't handle chars wider
than 7 bits. But in that case it's practical to implicitly "switch" to
IRI when wider chars are detected and automatically do the
transformation to URI.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Is there actually any benefit to having different classes for URI and
IRI?  It seems to me it just increases the possibility of selecting
the wrong one.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What I mean is, if Standards.URI->create() is passed an IRI, can't it
just convert that IRI to the corresponding URI and initialize the
object with that?  Shouldn't that be sufficient to handle IRIs as
well?  Why would we need a Standards.IRI?

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Well, one could argue that it'd help people pick the right one and
realize that they actually aren't using URIs anymore if they go
outside US-ASCII, which probably is a widespread misconception.

But there would also be practical differences:

o  An IRI class can decode the utf-8 sequences, which an URI class
   can't. (More precisely, it must do this precisely in the
   transformation from an URI.)
   
o  An IRI class doesn't necessarily have to do the transformation to
   URI since an IRI can contain wide chars in unencoded form. I.e. it
   should be able to put together and pick apart the IRI syntax with 8
   bit and wider chars on both sides.

o  As for the encoding side, extending the URI class to automatically
   do an IRI-to-URI conversion for wider chars is safe from a
   standards perspective (i.e. it wouldn't break the URI standard).
   But in practice it wouldn't be strictly compatible since
   Standards.URI currently treats 8-bit chars differently.

Last argument applies to the proposed change to
Protocols.HTTP.http_encode_string too, btw.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>Well, one could argue that it'd help people pick the right one and

How so?  If there is only one, then that one is always the right one.


>realize that they actually aren't using URIs anymore if they go
>outside US-ASCII, which probably is a widespread misconception.

Since the IRIs can be mapped to URIs, they can be made to actually use
URIs without them having to realize.


>But there would also be practical differences:
>
>o  An IRI class can decode the utf-8 sequences, which an URI class
>   can't. (More precisely, it must do this precisely in the
>   transformation from an URI.)

When must one transform from an URI then?  Using the URI
representation seems more powerful since it can represent both URIs
and IRIs.

Of course, having a function to decode the utf-8 sequences is
something we want, but this should be possible (and done in the same
way) regardless of whether you start with an IRI or an IRI mapped into
an URI, IMO.


>o  An IRI class doesn't necessarily have to do the transformation to
>   URI since an IRI can contain wide chars in unencoded form. I.e. it
>   should be able to put together and pick apart the IRI syntax with 8
>   bit and wider chars on both sides.

No, but I don't think the performance issue warrants a confusing split
in the namespace.  Whether the "picked apart" pieces contain wider
chars or not seems irrelevant since you need to decode it anyway
(%25, %2f).  The decoding should give you wide chars regardless of
whether you start with an IRI or an IRI mapped into an IRI (see
above).


>o  As for the encoding side, extending the URI class to automatically
>   do an IRI-to-URI conversion for wider chars is safe from a
>   standards perspective (i.e. it wouldn't break the URI standard).
>   But in practice it wouldn't be strictly compatible since
>   Standards.URI currently treats 8-bit chars differently.

Yes, but that is a bug, AFAICT, just as http_encode_string() is
currently bugged.  The behaviour we'd be removing is wrong, from a
standards point of view, so removing it from the Standards module
seems the right thing to do.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Of course, it's perfectly reasonable to keep the old behaviour of both
Protocols.HTTP.http_encode_string and Standards.URI->create() in
compat mode, to avoid breaking existing applications.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> How so?  If there is only one, then that one is always the right one.

It's not quite so simple when communicating with the outside world
which doesn't unify the two concepts.

The W3C chose to make IRI a separate standard instead of extending
URI. They've obviously pondered that approach at length, so I guess
they did it with good reason.

> >realize that they actually aren't using URIs anymore if they go
> >outside US-ASCII, which probably is a widespread misconception.
>
> Since the IRIs can be mapped to URIs, they can be made to actually use
> URIs without them having to realize.

The receiving side might not do the same transformation back. E.g.
when the URI is passed as an url over http there is no obligation -
not even in the latest standards - to do the reverse URI-to-IRI
transformation on the url after receiving the request. This makes it
good to be aware of what is happening, so one can judge better how the
receiver might (mis)behave.

> When must one transform from an URI then?

Huh? To process it, of course. E.g. unicode data sent in a web form,
where the de-facto behavor of modern browsers is to do an IRI-to-URI
transformation first. It'd be nice to have that decoding built into
the class.

> Using the URI representation seems more powerful since it can
> represent both URIs and IRIs.

The problem is that a URI can't fully represent an IRI. It can only
contain a (transformed) IRI, just like an octet string can contain a
URI.

> Of course, having a function to decode the utf-8 sequences is
> something we want, but this should be possible (and done in the same
> way) regardless of whether you start with an IRI or an IRI mapped into
> an URI, IMO.

Perhaps, but not if you start with an URI that isn't a transformed
IRI. Or are you suggesting that the URI class should just try to
decode it as an IRI and silently continue without the utf-8 decode if
that fails?

> /.../ I don't think the performance issue warrants a confusing split
> in the namespace.

I didn't say it was a performance issue either, rather one of
functionality. An IRI can e.g. contain "œôôÌ" in a unicode context
without any encoding whatsoever, whereas a URI can't. When writing
documents containing IRIs in a unicode environment it is of course
nice to see and handle the real glyphs directly. Hence the class
should be able to both produce and parse IRIs without escaping the
non-US-ASCII chars.

> Whether the "picked apart" pieces contain wider chars or not seems
> irrelevant since you need to decode it anyway (%25, %2f).

When used as I described above, the wider chars wouldn't be encoded to
begin with.

But besides, more functionality to alleviate the user from decoding
%XX escapes is in order.

> The decoding should give you wide chars regardless of whether you
> start with an IRI or an IRI mapped into an IRI (see above).

I assume at least one of the "IRI" there should be "URI". Decoding a
URI in general can't produce wide chars since it can't assume that the
URI is a transformed IRI.

Footnote: Now my pike discussion quota is used up for at least today.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>Perhaps, but not if you start with an URI that isn't a transformed
>IRI. Or are you suggesting that the URI class should just try to
>decode it as an IRI and silently continue without the utf-8 decode if
>that fails?

As far as I can see, there are 5 cases (assuming we start with an URI
we got from somewhere, and not with an IRI):

1) There are no (encoded) non-ASCII characters involved
2) The URI is an IRI with non-ASCII characters which has been mapped
   to an URI
3) The URI has not been mapped from an IRI, but contains non-ASCII
   characters encoded as UTF-8 anyway
4) The URI has not been mapped from an IRI, and contains non-ASCII
   characters encoded as ISO-8859-1
5) The URI has not been mapped from an IRI, and contains non-ASCII
   character encoded as something which is neither UTF-8 nor
   ISO-8859-1

If we start with case 5, there is no way to decode that correctly
(without additional context information), since we can't know what
character encoding to use.  The reasonable approach here would be to
through an error.  However, this case may very well be
indistinguishable from case 2-4.  So in order to guarantee an error
here, we'd have to always give an error for non-ASCII characters.  But
that would be bad, because we should at least handle case 2
correctly, since 1 and 2 are the sane cases.

Case 2 and 3 can be handled in the same way, so there is no need to
distinguish between them.  Case 4 can be distinguised from case 3 (and
2) from the fact that a printable string encoded as ISO-8859-1 never
is valid UTF-8, and vice versa, due to the range 0x80-0x9f being
mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in
ISO-8859-1.

So I see two options:

A) Decode as UTF-8 when possible, and throw an error otherwise.
   This gives correct results for case 1, 2, and 3, and throws an
   error in case 4.  Case 5 would usually give an error, but might
   give an incorrect result in some rare cases.

B) Decode as UTF-8 when possible, and decode as ISO-8859-1 otherwise
   (this is identical to the approach you mention).  This gives
   correct results for case 1, 2, 3, and 4, but always gives an
   incorrect result for case 5.

It would be nice to allow the user to specify an encoding, so that
case 5 could also be handled correctly, but if no such a specification
is give I think the default behaviour should be either A or B,
depending on the relative frequency of case 4 and case 5 in the real
world.  After all, the purpose of the standard class is to provide
the user with a service, so some kind of best effort is in order
here.  If the user has information that would allow him to do a better
job (which will usually not be the case, I predict), it's better that
this information is provided to the standard code.


>I didn't say it was a performance issue either, rather one of
>functionality. An IRI can e.g. contain "œôôÌ" in a unicode context
>without any encoding whatsoever, whereas a URI can't. When writing
>documents containing IRIs in a unicode environment it is of course
>nice to see and handle the real glyphs directly. Hence the class
>should be able to both produce and parse IRIs without escaping the
>non-US-ASCII chars.

It can contain the character unencoded, but it can also contain them
encoded.  So in order to see nice characters, you should always
decode.


>> Whether the "picked apart" pieces contain wider chars or not seems
>> irrelevant since you need to decode it anyway (%25, %2f).
>
>When used as I described above, the wider chars wouldn't be encoded to
>begin with.

Did you actually read the sentence you commented here?  I said that
you need to decode "%" and "/", which are not wider characters, and
which will be encoded even in an IRI.  Since you need to call a decode
function, it doesn't matter much if characters are encoded in the
input to said function, as long as they aren't in the output.


>> The decoding should give you wide chars regardless of whether you
>> start with an IRI or an IRI mapped into an IRI (see above).
>
>I assume at least one of the "IRI" there should be "URI".

Indeed.  And if you followed the suggestion to "see above", you can
probably guess which one.  :-)


>Decoding a URI in general can't produce wide chars since it can't
>assume that the URI is a transformed IRI.

Neither can it assume that it's not.  See the case study above.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

+1

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Btw, maybe Standards.URI()->http_encode should be fixed at the same
> time?  It doesn't seem to encode wide characters at all, and encodes
> 8-bit characters as %XX (iso-8859-1).

I think me and Johan Schön chickened out when making Standards.URI and
only aimed for the basic principle of taking URI:s apart and putting
them together again, without losing data or precision. The latter is a
bug, especially today, and probably ought to be fixed with prior utf-8
encoding.

I would very much welcome an improved variant with getters and setters
doing automatic encode/decode translation, perhaps in the form of a
Standards.URL, where such behaviours are more well defined (especially
for schemes http, https, ftp, ftps and maybe a few others) than for
the generic case of URIs, or abominations like the javascript: scheme.

Doing it in the form of an inheriting Standards.URL would have a bonus
benefit of not fscking up prior code. In practice you rarely have URIs
that are not URLs too, anyway, so getting a kick-ass Standards.URL for
such matters would be an improvement, and afford more useful defaults.

For tinkering with the URI parts, setting and getting them raw, the
low level Standards.URI could stay mostly as is, while most API users
would instead adopt tools better equipped for playing with URLs.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> /.../ the fact that a printable string encoded as ISO-8859-1 never
> is valid UTF-8, and vice versa, due to the range 0x80-0x9f being
> mandatory in valid UTF-8 (unless it's all ASCII) and non-printable in
> ISO-8859-1.

What do you mean they're mandatory in valid utf-8? They can occur in
valid utf-8, but they must not occur. Take the utf-8 encoding of "å",
for instance: 0xc3 0xa5 (i.e. the all too familiar "Ã¥").

Granted, such odd sequences of characters practically never occur in
other 8-bit character sets. So in practice it's fairly safe to just
try utf-8 decode and fall back to "unspecified 8-bit charset" if it
doesn't work.

It's however not completely safe, and another danger with that
approach is that if there is a utf-8 encoding error somewhere (e.g. a
variable truncated in the middle of a utf-8 sequence) then suddenly
nothing gets decoded and there's no error.

Anyway, I agree on this point: Most of the time - when the URI comes
from the outside - it's probably a good idea to just try utf-8
decoding and silently ignore errors, but not all the time. I.e. it
could be a default that the user may override.

> >/.../ Hence the class should be able to both produce and parse IRIs
> >without escaping the non-US-ASCII chars.
>
> It can contain the character unencoded, but it can also contain them
> encoded.  So in order to see nice characters, you should always
> decode.

That'd be rather clumsy. In this use case they wouldn't get encoded
and they wouldn't have to be decoded. If the unicode sequence "Ã¥" do
happen to occur (as uncommon as it might be) then it should still be
intact on the other side.

/.../
> Did you actually read the sentence you commented here?  I said that
> you need to decode "%" and "/", which are not wider characters, and
> which will be encoded even in an IRI.  Since you need to call a decode
> function, it doesn't matter much if characters are encoded in the
> input to said function, as long as they aren't in the output.

Yes I did read it. Perhaps you've missed the point, namely that I
could very well be able to use it without decoding afterwards at all,
as long as the wider chars are kept intact and I don't mind that the
special chars are kept encoded.

I think there's merit to jhs' reasoning in 16642703, namely that
Standards.URI tries to stay out of the charset issue altogether, at
least by default. It only does what it has to do to parse and format a
URI. That means encoding only the US-ASCII chars that would be
misinterpreted otherwise, and decoding nothing.

This way the user can afterwards, on the complete URI/IRI, choose to
encode chars outside US-ASCII if it's going to be used somewhere where
that's required.

More encoding and decoding services should be optional. It could be in
another class or perhaps enabled by an optional "charset" property.
That charset property could also take a special value for the "dwim
try-utf-8" approach discussed earlier.

So to sum up, with this reasoning Standards.URI.http_encode and
Standards.URI.quote currently encodes too much - by default they
shouldn't touch 8-bit chars.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>What do you mean they're mandatory in valid utf-8? They can occur in
>valid utf-8, but they must not occur.

Yes, you are right.  It was a thought error on my part.  It is UTF-9
which has this property.  Dang.


>Anyway, I agree on this point: Most of the time - when the URI comes
>from the outside - it's probably a good idea to just try utf-8
>decoding and silently ignore errors, but not all the time. I.e. it
>could be a default that the user may override.

Yes.  My suggestion is that that the decode function takes a second
argument with a charset name to override the charset heuristic.  We
could also allow something like "raw" as an alias for "iso-8859-1", to
signify "unspecified 8-bit charset" (whenever that would be
useful...).


>[...] If the unicode sequence "Ã¥" do
>happen to occur (as uncommon as it might be) then it should still be
>intact on the other side.

Um, yes?  It would be encoded as %c3%83%c2%a5, which would then be
decoded as "Ã¥" by the decode function.  That's pretty intact, no?


>Yes I did read it. Perhaps you've missed the point, namely that I
>could very well be able to use it without decoding afterwards at all,
>as long as the wider chars are kept intact and I don't mind that the
>special chars are kept encoded.

I don't see why you wouldn't mind special chars being encoded if you
mind that wide chars are.  As long as something is encoded, it will
neither display nicely, nor be usable in any other context than URL
manipulation.


>I think there's merit to jhs' reasoning in 16642703, namely that
>Standards.URI tries to stay out of the charset issue altogether, at
>least by default. It only does what it has to do to parse and format a
>URI. That means encoding only the US-ASCII chars that would be
>misinterpreted otherwise, and decoding nothing.

This only means that the issue is pushed somewhere else.  It doesn't
make it go away.  A really conservative approach would be to not
supply any Standards.URI at all, thay way we can be absolutely sure it
never does anything wrong.  We can also be absolutely sure that we're
not helping the users achieve anything.

I think the default behaviour should be to help the user as much as
possible.


>More encoding and decoding services should be optional. It could be in
>another class or perhaps enabled by an optional "charset" property.
>That charset property could also take a special value for the "dwim
>try-utf-8" approach discussed earlier.

With the API we have now, fully decoded strings can not be returned.
So rather than having a property, I think we should have a decode
function, to which the strings can be passed after the user code
separates them on "/" or whatever URI syntax still remains in the
string.  (In retrospect, it would be better if the URI class actually
parsed all the URI syntax, rather than returning something half
parsed.  That would mean path being array(string) instead of string.
Other fields might also be affected, I haven't checked.)


>So to sum up, with this reasoning Standards.URI.http_encode and
>Standards.URI.quote currently encodes too much - by default they
>shouldn't touch 8-bit chars.

And we need a Standards.URI.decode.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> >[...] If the unicode sequence "Ã¥" do
> >happen to occur (as uncommon as it might be) then it should still be
> >intact on the other side.
>
> Um, yes?  It would be encoded as %c3%83%c2%a5, which would then be
> decoded as "Ã¥" by the decode function.  That's pretty intact, no?

No. With "the other side" I meant in the formatted URI, not when the
it has been picked apart into its components again by another object.
I.e. something like this:

  > object o = Standards.URI("http://x.com/");
  > o->path = "recept/räksmörgås.html";
  > (string) o;
  Result: "http://x.com/recept/räksmörgås.html"

This is a perfectly acceptable IRI that can be put into an iso-8859-1
document. Applies when the URI/IRI is parsed too, of course. That's
the reason it can be useful to skip the encoding of chars outside
US-ASCII.

> /.../ So rather than having a property, I think we should have a
> decode function, to which the strings can be passed after the user
> code separates them on "/" or whatever URI syntax still remains in
> the string.

Sure, why not? Maybe it could take a charset too to know how to handle
the 8-bit chars. If the extra encoding gets likewise optional, it both
gets more symmetric and works in the use case I've been trying to
describe.

> (In retrospect, it would be better if the URI class actually parsed
> all the URI syntax, rather than returning something half parsed.
> That would mean path being array(string) instead of string. /.../

I'm not so sure; a path on array form gets unbearably cumbersome to
handle compared to the standard string form. An alternative is to only
decode as much as possible, i.e. leave only %2F (for "/") and %25 (for
"%"). That's a consistent encoding too that can be decoded the same
way after path splitting, if the user wants to. It's a bit unfortunate
that the "%" chars have to left encoded too, though.

Re: Protocols.HTTP.http_encode_string

by Martin Bähr :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sun, Jul 20, 2008 at 10:50:02PM +0000, Martin Stjernholm,  Roxen IS @ Pike  developers forum wrote:
> > >[...] If the unicode sequence "Ã¥" do
> > >happen to occur (as uncommon as it might be) then it should still be
> > >intact on the other side.
> > Um, yes?  It would be encoded as %c3%83%c2%a5, which would then be
> > decoded as "Ã¥" by the decode function.  That's pretty intact, no?
> No. With "the other side"

the "other side" of this conversation is missing again in the exported
list...

greetings, martin.

Protocols.HTTP.http_encode_string

by Johan Sundström (Achtung Liebe!) @ Pike (-) developers forum :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>> /.../ So rather than havi