The issue is around the User-Supplied Identifiers. OpenID defines them
> Thanks, Johnny. I've had some conversations with a few other people
> who draw the opposite conclusion and believe that the %AB%CD notation
> is the canonical form.
>
> You make a good point about having to unescape the characters from
> the URI just above the transport layer, but I believe you're applying
> section 4.1 to the URL when it should only be applied to the
> key/value pairs. The OpenID ClaimedIdentifier, which by the spec is
> the last URL to respond without an HTTP redirect, cannot be in
> unicode by the URI specification because unicode characters are not
> allowed, whether that is UTF8 or UTF16.
>
> Name/value pairs passed as part of a querystring may (and as the
> section you quote requires) be encoded as UTF-8, but they are
> subsequently URI encoded as %AB%CD hex characters (thus doubly
> encoded) so they are actually no longer UTF-8 at the transport layer.
> Since the OpenID URL, around which all the identity of OpenID is
> focused (omiting XRIs which don't suffer from this problem) /is/ at
> the transport layer of the way the security requirements force the
> claimed identifier to be discovered, is all about the transport
> layer, I believe it would be a mistake to add semantics on top of
> that and call it canonical.
>
> What I also realized from some other conversations is that this
> doesn't really matter. As long as an OP or RP is consistent within
> itself in storing and comparing Claimed Identifiers, whether it
> stores and compares %AB%CD or the unicode equivalent character won't
> matter to anyone, since on the protocol/wire level it is always
> %AB%CD. However, I think unescaping the URL and getting the original
> unicode characters back is very useful and should be done for
> purposes of displaying to the user.
>
> I think for the security and guaranteed identity of the protocol,
> there is a meaningful side to this though. It has not got to do with
> how the claimed identifier is stored, but rather how a unicode
> string is escaped for URI transport. A given unicode string may be
> represented by more than just one series of bytes. Unicode
> characters exist that in UTF-8 or UTF-16 have multiple byte sequences
> /for the same character/. Therefore someone who is typing in their
> OpenID url to a site using one method during one visit, and then
> types it in to the same site using a different method on a subsequent
> visit, will only be identified by the RP as the same visitor if
> OpenID requires that the RP transforms whatever unicode string is
> given by the user to the canonical byte form as defined by the
> unicode standard before transit. For example, the letter 'Á' can be
> encoded as a single character or using composition by adding an
> accent to the A character. Both are legal, but the unicode standard
> defines one as canonical (I think). But if a string containing this
> character is not canonicalized first, then although the character is
> equivalent to the user and to unicode, the encoded %AB%CD string will
> be different, resulting in security problems for OpenID because
> people could overload a single Identifier just by using different
> encodings at an OP, or fail to log into an RP depending on how they
> craft their string. By the way, I say 'unicode' in the strict sense,
> applying to UTF-8, UTF-16, etc. Unicode is commonly used to refer to
> just UTF-16, but this problem applies to all unicode character sizes.
>
>
>
>
> So I think OpenID should be more explicit about its unicode support
> for Identifiers, including mandating a canonical Unicode form.
>
> On Tue, Jul 8, 2008 at 9:41 PM, Johnny Bufu <
johnny.bufu@...
> <mailto:
johnny.bufu@...>> wrote:
>
>
> On 08/07/08 03:01 PM, Andrew Arnott wrote:
>
> What is the canonical form of an OpenID URL? One with the %AB%CD hex
> encoding for unicode chars in the URL or with the actual unicode
> chars? For the purposes of displaying to the user and storing in the
> RP's database.
>
> The spec doesn't seem to have anything to say on this.
>
>
> I believe it does say:
>
> 4.1. Protocol Messages The OpenID Authentication protocol messages
> are mappings of plain-text keys to plain-text values. The keys and
> values permit the full Unicode character set (UCS). When the keys and
> values need to be converted to/from bytes, they MUST be encoded
> using UTF-8 [RFC3629].
>
>
http://openid.net/specs/openid-authentication-2_0.html#anchor4>
>
> The reason I think it's not a simple automatic answer is the unicode
> chars may be what the user typed in and what exists on the server,
> but in transit, these characters are translated to %AB%CD in order to
> be validly escaped URI strings.
>
>
> The receiving party must decode them to the original form when they
> are extracted from the transport layer.
>
>
> So one could argue that the unicode characters are never part of the
> protocol
>
>
> One would then be ignoring the parts of the protocol that do not deal
> with the transport layer directly.
>
>
> Johnny
>
>
> !DSPAM:139,48744d86221113907413095!