ASCII-level Unicode Support for SableCC 4.0

View: New views
8 Messages — Rating Filter:   Alert me  

ASCII-level Unicode Support for SableCC 4.0

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

To make a long story short, I'll simply spell out my suggestion, instead
of explaining all the reflexion I had to get to it.

I- Character Set

SableCC 4.0 grammars are written in a subset of ASCII (0..127), more
precisely, (except in comments and literals):
ID_Start+ID_Nonstart+Pattern_Syntax+Pattern_White_Space Intersection ASCII.

In comments and literals, all missing ASCII characters >= 32 are added
to the above set.

character 127 counts as one character. Tab (HT, 9) counts as one
character (which is consistent with the recommendation to increase line
count by one for vertical tab VT).

SableCC 4.0 accepts all valid Unicode identifiers made of ASCII
characters only,  e.g. (ID_Start ID_Nonstart*) Intersection (ASCII+).

This means, in other words, that characters such as 0 (null) and 8
(backspace) cannot appear in the grammar source code.

II- Keywords and Identifiers

Keywords retain their traditional form: Lexer, Parser. Identifiers are
case-sensitive.

For consistency with future Unicode support, SableCC does a
case-insensitive collision detection between identifiers, but only a
case-sensitive collision detection between identifiers and keywords.

In other words:

Lexer
 lexer = 'Lexer'; // OK (id vs keyword => case sensitive)

 ab = 'ab';
 AB = 'AB'; // error : AB is too similar to ab (case insensitive similarity)


The general principle that justifies this, is how I see identifiers (and
keywords) managed when Unicode support is added: SableCC would
case-sensitive, using NFC normalization for relating identifiers to
their declaration and detecting keywords.

On the other hand, SableCC would explicitly reject "similar"
identifiers. The similarity of two identifiers A and B would be computed
taking into account: case folding, NFKD, and mixed-script confusables
mapping. If (transformed(A) == transformed(B) ) && (NFC(A) != NFC(B)) =>
SableCC raises an error.

So, this forces you to chose one representation of a word and stick with
it. I do not think that it is unreasonable to reject the use of the same
word with distinct capitalization. On the other hand, case sensitivity
is a good thing:

Lexer
  allo = 'allo';
Parser
  greeting = alLo ALlo;  // equivalent to: greeting = allo allo; !!!

It is definitely not a good idea to allow for this. So, SableCC would be
a strange beast: case-sensitive identifiers with case-insentitive (and
visual similarity) conflict detection.

For the annoying case where a keyword would happen to match exactly a
needed identifier, we could add a keyword:

Lexer
  Identifier(Lexer) = 'Lexer';
  ...

Parser
  grammar = lexer_section parser ...;
  lexer_section = Identifier(Lexer) ...;

We would still encourage the use of old-style identifiers (by providing
the most intuitive mapping for them in the target language), but it
would allow for things such as:

Parser
  grammar = ...;
  AST = ...;
  token = ...;


I don't think that we much restrict expression, by rejecting similar
identifiers. It is, anyway, usually not recommended to use the same word
with distinct case as distinct variable names in case-sensitive
languages, even though the compiler accepts it:

void foo(...) {
  int k, K;
  for (k = 0; k < 10; k++)
    for (K = 0; K < k; K++)
      T[k,K] = 3*T[K,k];
}

:-)

Have fun!

Etienne

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: ASCII-level Unicode Support for SableCC 4.0

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It appears that it was late, when I wrote this message. Please don't be
mistaken by the assertive form of the text; I was only making a
proposal. So, what I wrote is fully open to discussion. Please tell me
if you think it is a good idea or not, and propose improvements or
different approaches, if you think there are better approaches.

Thanks,

Etienne


Etienne M. Gagnon  wrote:
> To make a long story short, I'll simply spell out my suggestion,
> instead of explaining all the reflexion I had to get to it.
> [...]

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: ASCII-level Unicode Support for SableCC 4.0

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Addition to the proposal. I don't think that 4.0 should worry, yet,
about characters (graphemes). [My objective: minimal programming work in
the short term.]

III- Generated lexer

Generated lexers accept, as input:
1- byte streams => byte count (no line counting)
2- utf-8 streams => code-point count (and rejection of invalid UTF-8) &
byte (no line) count
3- for Java (?) => utf-16 stream => code-point count (and rejection of
invalid utf-16) & "char" (no line) count

On the right of "=>", above, I indicate the automatically provided counters.

Etienne

Etienne M. Gagnon wrote:
> I- Character Set
>
> SableCC 4.0 grammars are written in a subset of ASCII (0..127), more
> precisely, (except in comments and literals):
> ID_Start+ID_Nonstart+Pattern_Syntax+Pattern_White_Space Intersection
> ASCII.
> [...]
>

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Parent Message unknown Re: ASCII-level Unicode Support for SableCC 4.0

by Stephen P Spackman-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Please excuse in advance a rant that veers more than a little from the
topic....

FWIW, I firmly believe that case distinction is important to the
intelligibility of the roman alphabet, and should be available to the
programmer wherever this is at all possible. 'NoBend' and 'NobEnd' simply
have nothing in common as pieces of language, and collapsing them is,
well, unhelpful. The deliberate introduction of case ambiguity was a
disaster in Lisp and a disaster in VFAT and it doesn't bear repeating.
Automatic collision detection doesn't help, because it defeats systematic
naming schemes (otherwise we'd still be happy with seven character
identifiers, right?). Usable notations are *not* a hashing problem, even
if back end mappings are. If NoBend and NobEnd are to collide, then put us
out of our misery and say 'no upper case, no titlecase, period.' It will
suck, but it will suck predictably. Give us a few more gnoise characters
for our own names and we can play sys$input like in the old days. What the
programmer truly needs is tools for constructing ad hoc namespaces.

So while I agree that there are problems in automatic case *mapping*
schemes (Unicode failed to take the bull by the horns in several important
places - Turkish I should no more have been folded together with Roman I
than ezh and yogh should have been collapsed (though I think the latter
was eventually resolved after much lobbying because without it you can't
represent the OED for Pete's sake), and sentence initial capitals (a
punctuation mark) are *not the same thing* as the word-inherent capitals
on proper nouns, in mathematical symbols, and in acronyms, and again *not
the same thing* as SHOUTING, which is a font issue - and we all pay for
that), I would really like more use of case given to me rather than taken
away.

It's also easy to go too far in the direction of saving us from visual
ambiguity. In all seriousness, it is more sensible to forbid the use of
'I' and 'l' in the same scope than it is to say that there is an
'ambiguity' between 'b' and 'B'.

> For the annoying case where a keyword would happen to match exactly a
> needed identifier, we could add a keyword:
>
> Lexer
>  Identifier(Lexer) = 'Lexer';

This is an interesting approach, but perhaps the notation could be made
lighter? In particular, if my grammar was generated automatically, I would
probably escape *all* identifiers like this, but that would make the
output tedious to read. A single punctuation mark would do the trick -
perhaps identifiers are preceded by an '@' or (quite traditionally!)
enclosed in '<...>', the decoration becoming optional when it does *not*
conflict with a keyword?

Regards
Stephen


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: ASCII-level Unicode Support for SableCC 4.0

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat 2008-05-17 at 01:38h, Etienne M. Gagnon wrote on sablecc-discussion:
:
> This means, in other words, that characters such as 0 (null) and 8
> (backspace) cannot appear in the grammar source code.

You might want to allow a BOM (U+FEFF) at the start of a grammar and
a Ctrl+Z (U+001A) at the end of the grammar.

:

> For consistency with future Unicode support, SableCC does a
> case-insensitive collision detection between identifiers, but only a
> case-sensitive collision detection between identifiers and keywords.
>
> In other words:
>
> Lexer
> lexer = 'Lexer'; // OK (id vs keyword => case sensitive)
>
> ab = 'ab';
> AB = 'AB'; // error : AB is too similar to ab (case insensitive similarity)

This is bad. If a future version of SableCC introduces a new keyword,
existing grammars using it as an identifier will stop working.

I think that SableCC should simply (continue to) reserve the namespace
of identifiers starting with an upper-case (or title-case) letter for
keywords.

> The general principle that justifies this, is how I see identifiers (and
> keywords) managed when Unicode support is added: SableCC would
> case-sensitive, using NFC normalization for relating identifiers to
> their declaration and detecting keywords.
>
> On the other hand, SableCC would explicitly reject "similar"
> identifiers. The similarity of two identifiers A and B would be computed
> taking into account: case folding, NFKD, and mixed-script confusables
> mapping. If (transformed(A) == transformed(B) ) && (NFC(A) != NFC(B)) =>
> SableCC raises an error.
>
> So, this forces you to chose one representation of a word and stick with
> it. I do not think that it is unreasonable to reject the use of the same
> word with distinct capitalization. On the other hand, case sensitivity
> is a good thing:
>
> Lexer
>  allo = 'allo';
> Parser
>  greeting = alLo ALlo;  // equivalent to: greeting = allo allo; !!!
>
> It is definitely not a good idea to allow for this. So, SableCC would be
> a strange beast: case-sensitive identifiers with case-insentitive (and
> visual similarity) conflict detection.

I'm not convinced. Back-ends will either ignore the original case when
mapping identifiers to the target language, in which case the ability
to use mixed-case identifiers will only have limited benefit as it
doesn't carry over to the target language, or else the grammar writer
will be compelled to choose the casing in the grammar such that it
matches the naming conventions in the intended target language, which
isn't exactly a good thing either.

If back-ends perform case normalization, users will be confronted with
the fact that "case doesn't matter" anyway, and then I think it's fine
if the grammar identifiers are collectively restricted to one case,
for example lower-case. Basically, I find the current SableCC model to
be preferable.

Furthermore, for characters equal under NFKC normalization that are
truly confusable, it might be difficult for users to correct a
rejected grammar precisely because the characters do look the same to
them. As normalization of such confusable characters in the target
language won't hurt (or else they're not so confusable after all), I
don't see the benefit of prohibiting different representations in the
grammar source. Put slightly differently: When confusable characters
are normalized to the same representation, then they aren't really
confusable characters any more; rather they ARE the same character.

Lastly, case folding and NFKC normalization should not be just lumped
together. Characters that differ just in case are "similar" in quite a
different sense than characters having the same NFKC normalization.

:
> I don't think that we much restrict expression, by rejecting similar
> identifiers. It is, anyway, usually not recommended to use the same word
> with distinct case as distinct variable names in case-sensitive
> languages, even though the compiler accepts it:

It's also not recommended to use identifiers like 'l' and 'O' that may
be confused with '1' and '0', etc. So would it be fine to disallow
them too? :)

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: ASCII-level Unicode Support for SableCC 4.0

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Sat 2008-05-17 at 11:31h, Etienne M. Gagnon wrote on sablecc-discussion:
:
> Generated lexers accept, as input:
> 1- byte streams => byte count (no line counting)
> 2- utf-8 streams => code-point count (and rejection of invalid UTF-8) &
> byte (no line) count
> 3- for Java (?) => utf-16 stream => code-point count (and rejection of
> invalid utf-16) & "char" (no line) count

1) I would suggest to allow unpaired surrogate code points in UTF-8
and UTF-16 inputs, as otherwise it would be difficult to write
grammars for languages that allow them. Programs that want to reject
those (if not already caught by the grammar) can easily do so by
wrapping the streams.

For 4.1+: Similarly, for grammars defined in terms of 16-bit values,
an input variation would be needed that doesn't combine surrogate
pairs into one character.

2) I would strongly urge to provide positional information in terms of
input code units. E.g. byte count for UTF-8 and char count for UTF-16.
Otherwise it will be extra work (and extra opportunity for errors) for
the client code to locate the reported positions within the input.

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: ASCII-level Unicode Support for SableCC 4.0

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Stephen,

You win. :-)  [See my other messages. In summary: $keyword, idEntifier
and Identifier can coexist.]

Etienne

Stephen P Spackman wrote :
> FWIW, I firmly believe that case distinction is important to the
> intelligibility of the roman alphabet, and should be available to the
> programmer wherever this is at all possible. [...]
>  
> [...] A single punctuation mark would do the trick -
> perhaps identifiers are preceded by an '@' or (quite traditionally!)
> enclosed in '<...>', the decoration becoming optional when it does *not*
> conflict with a keyword?
>  

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: ASCII-level Unicode Support for SableCC 4.0

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Niklas Matthies wrote:
> You might want to allow a BOM (U+FEFF) at the start of a grammar and
> a Ctrl+Z (U+001A) at the end of the grammar.
>  

Yes, as soon as we go beyond ASCII. (4.1+)

> Furthermore, for characters equal under NFKC normalization that are
> truly confusable, it might be difficult for users to correct a
> rejected grammar precisely because the characters do look the same to
> them.
As I said in another message, I've been convinced by the various
arguments (now) that SableCC should not worry about visual security; it
is the job of third party tools.

Also, there's no good reason for a good Unicode editor to use "obsolete"
characters, or to show distinct characters similarly on the screen.

I remember a cheap dot matrix printer that I had, on which all of the
following looked similar: 1, l, and I. Yet, I don't remember a
programming language that protected me from this kind of visual confusion.

Have fun!

Etienne

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment
LightInTheBox - Buy quality products at wholesale price