|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
ASCII-level Unicode Support for SableCC 4.0To make a long story short, I'll simply spell out my suggestion, instead
of explaining all the reflexion I had to get to it. I- Character Set SableCC 4.0 grammars are written in a subset of ASCII (0..127), more precisely, (except in comments and literals): ID_Start+ID_Nonstart+Pattern_Syntax+Pattern_White_Space Intersection ASCII. In comments and literals, all missing ASCII characters >= 32 are added to the above set. character 127 counts as one character. Tab (HT, 9) counts as one character (which is consistent with the recommendation to increase line count by one for vertical tab VT). SableCC 4.0 accepts all valid Unicode identifiers made of ASCII characters only, e.g. (ID_Start ID_Nonstart*) Intersection (ASCII+). This means, in other words, that characters such as 0 (null) and 8 (backspace) cannot appear in the grammar source code. II- Keywords and Identifiers Keywords retain their traditional form: Lexer, Parser. Identifiers are case-sensitive. For consistency with future Unicode support, SableCC does a case-insensitive collision detection between identifiers, but only a case-sensitive collision detection between identifiers and keywords. In other words: Lexer lexer = 'Lexer'; // OK (id vs keyword => case sensitive) ab = 'ab'; AB = 'AB'; // error : AB is too similar to ab (case insensitive similarity) The general principle that justifies this, is how I see identifiers (and keywords) managed when Unicode support is added: SableCC would case-sensitive, using NFC normalization for relating identifiers to their declaration and detecting keywords. On the other hand, SableCC would explicitly reject "similar" identifiers. The similarity of two identifiers A and B would be computed taking into account: case folding, NFKD, and mixed-script confusables mapping. If (transformed(A) == transformed(B) ) && (NFC(A) != NFC(B)) => SableCC raises an error. So, this forces you to chose one representation of a word and stick with it. I do not think that it is unreasonable to reject the use of the same word with distinct capitalization. On the other hand, case sensitivity is a good thing: Lexer allo = 'allo'; Parser greeting = alLo ALlo; // equivalent to: greeting = allo allo; !!! It is definitely not a good idea to allow for this. So, SableCC would be a strange beast: case-sensitive identifiers with case-insentitive (and visual similarity) conflict detection. For the annoying case where a keyword would happen to match exactly a needed identifier, we could add a keyword: Lexer Identifier(Lexer) = 'Lexer'; ... Parser grammar = lexer_section parser ...; lexer_section = Identifier(Lexer) ...; We would still encourage the use of old-style identifiers (by providing the most intuitive mapping for them in the target language), but it would allow for things such as: Parser grammar = ...; AST = ...; token = ...; I don't think that we much restrict expression, by rejecting similar identifiers. It is, anyway, usually not recommended to use the same word with distinct case as distinct variable names in case-sensitive languages, even though the compiler accepts it: void foo(...) { int k, K; for (k = 0; k < 10; k++) for (K = 0; K < k; K++) T[k,K] = 3*T[K,k]; } :-) Have fun! Etienne -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: ASCII-level Unicode Support for SableCC 4.0It appears that it was late, when I wrote this message. Please don't be
mistaken by the assertive form of the text; I was only making a proposal. So, what I wrote is fully open to discussion. Please tell me if you think it is a good idea or not, and propose improvements or different approaches, if you think there are better approaches. Thanks, Etienne Etienne M. Gagnon wrote: > To make a long story short, I'll simply spell out my suggestion, > instead of explaining all the reflexion I had to get to it. > [...] -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: ASCII-level Unicode Support for SableCC 4.0Addition to the proposal. I don't think that 4.0 should worry, yet,
about characters (graphemes). [My objective: minimal programming work in the short term.] III- Generated lexer Generated lexers accept, as input: 1- byte streams => byte count (no line counting) 2- utf-8 streams => code-point count (and rejection of invalid UTF-8) & byte (no line) count 3- for Java (?) => utf-16 stream => code-point count (and rejection of invalid utf-16) & "char" (no line) count On the right of "=>", above, I indicate the automatically provided counters. Etienne Etienne M. Gagnon wrote: > I- Character Set > > SableCC 4.0 grammars are written in a subset of ASCII (0..127), more > precisely, (except in comments and literals): > ID_Start+ID_Nonstart+Pattern_Syntax+Pattern_White_Space Intersection > ASCII. > [...] > -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
|
|
|
Re: ASCII-level Unicode Support for SableCC 4.0On Sat 2008-05-17 at 01:38h, Etienne M. Gagnon wrote on sablecc-discussion:
: > This means, in other words, that characters such as 0 (null) and 8 > (backspace) cannot appear in the grammar source code. You might want to allow a BOM (U+FEFF) at the start of a grammar and a Ctrl+Z (U+001A) at the end of the grammar. : > For consistency with future Unicode support, SableCC does a > case-insensitive collision detection between identifiers, but only a > case-sensitive collision detection between identifiers and keywords. > > In other words: > > Lexer > lexer = 'Lexer'; // OK (id vs keyword => case sensitive) > > ab = 'ab'; > AB = 'AB'; // error : AB is too similar to ab (case insensitive similarity) This is bad. If a future version of SableCC introduces a new keyword, existing grammars using it as an identifier will stop working. I think that SableCC should simply (continue to) reserve the namespace of identifiers starting with an upper-case (or title-case) letter for keywords. > The general principle that justifies this, is how I see identifiers (and > keywords) managed when Unicode support is added: SableCC would > case-sensitive, using NFC normalization for relating identifiers to > their declaration and detecting keywords. > > On the other hand, SableCC would explicitly reject "similar" > identifiers. The similarity of two identifiers A and B would be computed > taking into account: case folding, NFKD, and mixed-script confusables > mapping. If (transformed(A) == transformed(B) ) && (NFC(A) != NFC(B)) => > SableCC raises an error. > > So, this forces you to chose one representation of a word and stick with > it. I do not think that it is unreasonable to reject the use of the same > word with distinct capitalization. On the other hand, case sensitivity > is a good thing: > > Lexer > allo = 'allo'; > Parser > greeting = alLo ALlo; // equivalent to: greeting = allo allo; !!! > > It is definitely not a good idea to allow for this. So, SableCC would be > a strange beast: case-sensitive identifiers with case-insentitive (and > visual similarity) conflict detection. I'm not convinced. Back-ends will either ignore the original case when mapping identifiers to the target language, in which case the ability to use mixed-case identifiers will only have limited benefit as it doesn't carry over to the target language, or else the grammar writer will be compelled to choose the casing in the grammar such that it matches the naming conventions in the intended target language, which isn't exactly a good thing either. If back-ends perform case normalization, users will be confronted with the fact that "case doesn't matter" anyway, and then I think it's fine if the grammar identifiers are collectively restricted to one case, for example lower-case. Basically, I find the current SableCC model to be preferable. Furthermore, for characters equal under NFKC normalization that are truly confusable, it might be difficult for users to correct a rejected grammar precisely because the characters do look the same to them. As normalization of such confusable characters in the target language won't hurt (or else they're not so confusable after all), I don't see the benefit of prohibiting different representations in the grammar source. Put slightly differently: When confusable characters are normalized to the same representation, then they aren't really confusable characters any more; rather they ARE the same character. Lastly, case folding and NFKC normalization should not be just lumped together. Characters that differ just in case are "similar" in quite a different sense than characters having the same NFKC normalization. : > I don't think that we much restrict expression, by rejecting similar > identifiers. It is, anyway, usually not recommended to use the same word > with distinct case as distinct variable names in case-sensitive > languages, even though the compiler accepts it: It's also not recommended to use identifiers like 'l' and 'O' that may be confused with '1' and '0', etc. So would it be fine to disallow them too? :) -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: ASCII-level Unicode Support for SableCC 4.0On Sat 2008-05-17 at 11:31h, Etienne M. Gagnon wrote on sablecc-discussion:
: > Generated lexers accept, as input: > 1- byte streams => byte count (no line counting) > 2- utf-8 streams => code-point count (and rejection of invalid UTF-8) & > byte (no line) count > 3- for Java (?) => utf-16 stream => code-point count (and rejection of > invalid utf-16) & "char" (no line) count 1) I would suggest to allow unpaired surrogate code points in UTF-8 and UTF-16 inputs, as otherwise it would be difficult to write grammars for languages that allow them. Programs that want to reject those (if not already caught by the grammar) can easily do so by wrapping the streams. For 4.1+: Similarly, for grammars defined in terms of 16-bit values, an input variation would be needed that doesn't combine surrogate pairs into one character. 2) I would strongly urge to provide positional information in terms of input code units. E.g. byte count for UTF-8 and char count for UTF-16. Otherwise it will be extra work (and extra opportunity for errors) for the client code to locate the reported positions within the input. -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: ASCII-level Unicode Support for SableCC 4.0Hi Stephen,
You win. :-) [See my other messages. In summary: $keyword, idEntifier and Identifier can coexist.] Etienne Stephen P Spackman wrote : > FWIW, I firmly believe that case distinction is important to the > intelligibility of the roman alphabet, and should be available to the > programmer wherever this is at all possible. [...] > > [...] A single punctuation mark would do the trick - > perhaps identifiers are preceded by an '@' or (quite traditionally!) > enclosed in '<...>', the decoration becoming optional when it does *not* > conflict with a keyword? > -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: ASCII-level Unicode Support for SableCC 4.0Niklas Matthies wrote:
> You might want to allow a BOM (U+FEFF) at the start of a grammar and > a Ctrl+Z (U+001A) at the end of the grammar. > Yes, as soon as we go beyond ASCII. (4.1+) > Furthermore, for characters equal under NFKC normalization that are > truly confusable, it might be difficult for users to correct a > rejected grammar precisely because the characters do look the same to > them. As I said in another message, I've been convinced by the various arguments (now) that SableCC should not worry about visual security; it is the job of third party tools. Also, there's no good reason for a good Unicode editor to use "obsolete" characters, or to show distinct characters similarly on the screen. I remember a cheap dot matrix printer that I had, on which all of the following looked similar: 1, l, and I. Yet, I don't remember a programming language that protected me from this kind of visual confusion. Have fun! Etienne -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
| Free Forum Powered by Nabble | Forum Help |