|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Lexer generation performance and memory usageIt appears that I've hit a good test case for lexer generation
performance. I'm extending a language to Unicode identifier syntax (see http://www.unicode.org/reports/tr31/) using the XID_Start and XID_Continue character classes. I adopted these fairly straightforward from the Unicode standard's DerivedCoreProperties.txt[1] to SableCC 3.2 syntax: unicode_xid_start = // as defined by DerivedCoreProperties.txt of Unicode 5.1.0, but restricted to the BMP [0x0041..0x005A] | // L& [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z [0x0061..0x007A] | // L& [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z 0x00AA | // L& FEMININE ORDINAL INDICATOR 0x00B5 | // L& MICRO SIGN 0x00BA | // L& MASCULINE ORDINAL INDICATOR [0x00C0..0x00D6] | // L& [23] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER O WITH DIAERESIS [0x00D8..0x00F6] | // L& [31] LATIN CAPITAL LETTER O WITH STROKE..LATIN SMALL LETTER O WITH DIAERESIS [0x00F8..0x01BA] | // L& [195] LATIN SMALL LETTER O WITH STROKE..LATIN SMALL LETTER EZH WITH TAIL 0x01BB | // Lo LATIN LETTER TWO WITH STROKE [0x01BC..0x01BF] | // L& [4] LATIN CAPITAL LETTER TONE FIVE..LATIN LETTER WYNN ... ~650 more lines like this ... The lexer happens to have five lexer states. After seven long minutes with 100% CPU, while generating the DFA for the third state SableCC bails out with java.lang.OutOfMemoryError: Java heap space. So far so bad. I suspect that SableCC doesn't recognize that the above is still just a character class, so one way to improve the lexer specification would be to use character class syntax with "+" rather than "|" for union. Étienne, do you think that would help? Although I have to admit that I'm not really keen on having to transform the above list into a "+" binary tree... Another thing is that I was planning to eventually add the characters outside of the BMP (basic multilingual plane), which with SableCC 3.2 requires them to be handled with surrogate pairs in the lexer grammar, if I'm not mistaken. So it couldn't be a plain character class any more. Anyway, maybe SableCC 4 could strive to be able to handle something like the above. :) -- Niklas Matthies [1] http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageOn Wed 2008-05-07 at 06:58h, I wrote on sablecc-discussion:
: > I suspect that SableCC doesn't recognize that the above is still just > a character class, so one way to improve the lexer specification would > be to use character class syntax with "+" rather than "|" for union. I just gave this a try, and it really does solve the problem. Though the lexer specifications now looks like LISP with square brackets. :) -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageHi Niklas,
(1) For the "heap space" message, don't forget that Sun's Java virtual machine limits heap size to 16M (!) by default. So, to handle memory intensive computations, you need to specify a bigger heap : $ java -Xmx1024M -jar sablecc.jar mygrammar.sablecc (2) Effectively, SableCC 3 handles such things less than optimally: it does not minimize the produced DFA, it does not minimize alphabets, using "|" instead of '+' has dramatic effect on lexer computation (and produced lexer size), etc. Not that SableCC 4 corrects all this. In SableCC 4, the explicit character set notation (using "[" and "]") is simply eliminated, and the lexer engine minimizes both the DFA and the alphabet. (3) I've been looking at Unicode, a little bit. I think that SableCC (4, or 5?) should require grammar files to be encoded in UTF-8. Additionally, it should eventually have a "unicode" mode that provides "Extended Unicode Support" (level 2 of http://unicode.org/unicode/reports/tr18/) to handle "grapheme clusters" as single characters. Note that I do not intend to implement Perl regular expressions (discussed in tr18) in SableCC. Perl regular expressions are NOT regular expressions (i.e. they do not describe regular languages, in language theory terms), they are a form of pattern matching. Regular languages have the nice property that they can be matched in linear time. In contrast, pattern matching is much more expensive. If Unicode grapheme cluster handling is desired by the community, I will need help implementing it (i.e. figuring out the various details of the Unicode specification, implementing the code, etc.). For now, I was only aiming to provide level 1 of tr18 in SableCC 4 (or possibly not even that, if it was to delay the release too much). Do not forget that SableCC 4 should generate lexer in a wide variety of target languages, so generated lexers cannot depend on Java standard libraries to accomplish their work. Also, I do not like having dependence on third party libraries, as they can cause head aches to people that worry about licensing issues. Etienne Niklas Matthies wrote: > It appears that I've hit a good test case for lexer generation > performance. I'm extending a language to Unicode identifier syntax > (see http://www.unicode.org/reports/tr31/) using the XID_Start and > XID_Continue character classes. I adopted these fairly straightforward > from the Unicode standard's DerivedCoreProperties.txt[1] to SableCC 3.2 > syntax: > [...] > The lexer happens to have five lexer states. After seven long minutes > with 100% CPU, while generating the DFA for the third state SableCC > bails out with java.lang.OutOfMemoryError: Java heap space. > [...] > I suspect that SableCC doesn't recognize that the above is still just > a character class, so one way to improve the lexer specification would > be to use character class syntax with "+" rather than "|" for union. > Étienne, do you think that would help? > [...] > Another thing is that I was planning to eventually add the characters > outside of the BMP (basic multilingual plane), which with SableCC 3.2 > requires them to be handled with surrogate pairs in the lexer grammar, > if I'm not mistaken. So it couldn't be a plain character class any more. > > Anyway, maybe SableCC 4 could strive to be able to handle something > like the above. :) > Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageOn Wed 2008-05-07 at 10:58h, Etienne M. Gagnon wrote on sablecc-discussion:
> Hi Niklas, > > (1) For the "heap space" message, don't forget that Sun's Java virtual > machine limits heap size to 16M (!) by default. The default maximum heap size depends on a lot of factors; with a current client VM under 32-bit Windows using the default GC it's usually 64M. But this is a good point. Trying again, the process succeeded using less than 200M of memory, in 13 minutes on my machine. The lexer.dat is about 2.5 the size of the '+' version. For a running VM one can find out the maximum heap size with jconsole for example (part of the JDK). Jconsole also gives you pretty graphs on the actual memory usage. : > (3) I've been looking at Unicode, a little bit. I think that SableCC (4, > or 5?) should require grammar files to be encoded in UTF-8. Yes, I also think that mandating UTF-8 would be best. UTF-8 makes it likely that grammar files with the wrong encoding can actually be detected (and rejected). It's also a good thing to not make it configurable (à la XML encoding declaration or with a command line switch), because this only leads to situations where the declared encoding does not match the actual encoding, and makes it harder for editors and other tools to process grammar files using the correct encoding. Been through that too often. I have a couple of related suggestions: - Allow Unicode identifiers. From my recent experience with TR31 I would suggest to allow the intersection of non-uppercase and non-titlecase Java identifiers with the XID* set from TR31, and to use NFKC normalization for identifer matching and back end identifier generation. - Non-ASCII characters in character literals should be interpreted as Unicode code points, not as UTF-16 code units. For example the UTF-8 sequence 0xF0 0x9D 0x85 0xA0 (U+1D160) should be interpreted as the single input character 0x1D160, not as the two-character sequence 0xD834 0xDD60 (which is the UTF-16 encoding of U+1D160). Accordingly the default back end should convert the sequence of Java chars to Unicode code points when feeding the input to the lexer. - Provide an escape syntax for Unicode characters that can be used anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as in C and C++.) This way any grammar file can trivially be turned into an exactly equivalent 7-bit ASCII encoded version for robust transfer and exchange. : > If Unicode grapheme cluster handling is desired by the community, I will > need help implementing it (i.e. figuring out the various details of the > Unicode specification, implementing the code, etc.). It would be nice if Unicode support along the lines of TR18 would be made available in the form of "Helper" libraries (generated from the Unicode data files) that can be included in the lexer specification using some sort of include directive. For example instead of special regular expression syntax like "\p{letter}", a helper symbol "unicode_letter" would be made available. This would also make it easy to support different versions of Unicode independent of the SableCC version. Languages are often specified relative to a specific Unicode version, and don't want a newer SableCC version or a different JDK to silently apply a different version of Unicode. -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageYes, I also think that mandating UTF-8 would be best. UTF-8 makes it likely that grammar files with the wrong encoding can actually be detected (and rejected). It's also a good thing to not make it configurable (à la XML encoding declaration or with a command line switch), because this only leads to situations where the declared encoding does not match the actual encoding, and makes it harder for editors and other tools to process grammar files using the correct encoding. Been through that too often. We agree. - Allow Unicode identifiers. From my recent experience with TR31 I would suggest to allow the intersection of non-uppercase and non-titlecase Java identifiers with the XID* set from TR31, and to use NFKC normalization for identifer matching and back end identifier generation. I think that it would be a big mistake to allow Unicode identifiers. Unicode should be restricted to comments and string literals. Here's why: 1- Readability of grammars: Imagine seeing the following: Lexer ??????? = '???'; ???_??? = '?????'; ...This is likely to show when reading a grammar with identifiers made of characters that are not installed. (Another symbol than '?' might be used by the reader program). I think that it is less harmful if the only unreadable parts are comments and literals. 2- Keeping clear distinction between identifiers and keywords: SableCC requires that identifiers be constructed of lowercase characters, and it reserves uppercase characters for keywords. This only works well on a small subset of Unicode letters, as some characters do not have uppercase/lowercase representations. Worse... According to TR31, the definition of which character is acceptable in identifiers is not predetermined for yet unassigned code points. I definitely do not want to push SableCC into such murky areas. 3- SableCC must be able to generate easily usable identifiers in target languages (the killer argument): Limiting identifiers to the current ASCII lowercase (with digits and underscore separator) allows for the most flexible and intuitive identifier selection for each target language. SableCC can chose to keep lowercases and underscores, or convert identifiers to camel case, or do any other simply manipulation, to generate identifiers that match the conventions of the target language. Imagine what would happen if SableCC had to generate plain ISO C 89 code for a grammar containing: étienne_keyword = 'Étienne';Plain portable C code cannot contain unicode characters. SableCC should be able to generate code in various languages: Java, C#, C++, Ruby, Python, Perl , COBOL, Fortran, Basic, Pascal, Ada, Modula, OCaml, ML, Lisp, Scheme, Haskell, etc. In each language, we want identifiers to be easy to write! We definitely do not want users to have to type something that looks like: {
...
_eacute_tienne__keyword e = ...;
...
}
- Non-ASCII characters in character literals should be interpreted as Unicode code points, not as UTF-16 code units. Of course. As I said, SableCC 4 is not Java-specific. It definitely is not defined in terms of choices made (mostly for backward compatibility with 16-bit char type) by the Java API designers. I also think that in generated lexers for the Java target, SableCC should provide its own utf8 byte stream reader that would convert byte sequences into code points stored into Java "int"s (32-bit), then apply the lexer engine on the code points. For intuitiveness, the String representation of tokens should be encoded according to Java usages (UTF-16, right?). - Provide an escape syntax for Unicode characters that can be used anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as in C and C++.) This way any grammar file can trivially be turned into an exactly equivalent 7-bit ASCII encoded version for robust transfer and exchange. Actually, SableCC is even more flexible, it allows for unbounded character codes using the syntax #1236452837 and #x234876abcdef23894.
You're mostly right. SableCC still needs to print error messages, for erroneous grammars. To do this, it needs to be able to unambiguously specify a line number and a character number on that line. This requires level 2 of TR18. Line numbers only requires level 1, but level 2 would provide the expected behavior for text editors. A consequence of level 2 support would be the rejection of incomplete grapheme clusters in the input grammar. One could not write: Lexer // I use [] around each code point. // This is for illustration only; it is not valid SableCC input. token = '[acute modifier]' // the complete grapheme would be '[acute modifier][e]'So, to put an incomplete grapheme in a regular expression, one would need to use the #123 syntax. The big advantages: 1- Correct display of a grammar in Unicode-aware editors. 2- Unambiguous character number (on a line). 3- No rejection of a grammar file by picky Unicode editors. On the other hand, I do not plan to support grapheme clusters in ranges. This would be best provided as helper libraries. e.g. Helpers // rejected: the lower bound is not a single code point some_character_range = '[acute modifier][e]'..'z'; // accepted: (assuming 'a' and 'z' are single code points) letter = 'a'..'z';Anyway, correct handling ranges of grapheme clusters (from a user point of view) requires tr18 level 3 support. I think that generated lexers need to implement TR18 level 2, too, as to give intuitive line/character numbers to tokens. But, all this requires much work. This is why I think that SableCC 4 should either: 1- Limit input grammars to 7-bit ASCII, pending Unicode support -> forward compatible. 2- Handle TR18 level 1 (i.e. understand code points and identify line numbers) -> not completely forward compatible, as it would allow for incomplete grapheme clusters. We can do more in 4.1+ :-) Etienne -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageOn Thu 2008-05-08 at 09:14h, Etienne M. Gagnon wrote on sablecc-discussion:
> Niklas Matthies wrote: : > >- Allow Unicode identifiers. From my recent experience with TR31 > >I would suggest to allow the intersection of non-uppercase and > >non-titlecase Java identifiers with the XID* set from TR31, and to > >use NFKC normalization for identifer matching and back end identifier > >generation. > > I think that it would be a big mistake to allow Unicode identifiers. > Unicode should be restricted to comments and string literals. I used to have that opinion too for a long time, but working in multi-language projects (German development team working in French application domains with internal documentation in English and Luxemburgish customers...), discussions like those surrounding http://www.python.org/dev/peps/pep-3131/ and learning Japanese eventually changed my mind. The major reason is that if you write a grammar for a domain-specific language, and the domain happens to be language/culture-specific, then either you have to find translations for the domain-specific terms into English, which can be awkward and difficult to follow for everyone familiar with the original domain terms, or you have to transliterate to US-ASCII, which goes from ugly to unreadable depending on the native alphabet. The other reason is for the use in teaching in another language. For example the Chinese would surely appreciate if they can use Chinese identifiers when teaching parser generators and LALR grammars. It's arguably less typical than for general-purpose programming languages, but I found http://groups.google.de/group/comp.lang.python/msg/ccffec1abd4dd24d to be a telling occurrence. Even in German it is a nuisance to have to write stuff like "schluesselwort" and "grossbuchstabe" instead of "schlüsselwort" ("keyword") and "großbuchstabe" ("uppercase letter"). Now think of people with arabic or cyrillic or asian scripts. Here's why: > > 1- Readability of grammars: Imagine seeing the following: > > Lexer > ??????? = '???'; > ???_??? = '?????'; > ... > > This is likely to show when reading a grammar with identifiers made of > characters that are not installed. (Another symbol than '?' might be > used by the reader program). I think that it is less harmful if the only > /unreadable/ parts are comments and literals. - Converting it to Unicode escape sequences -- there would likely be a standard tool for this -- will make it readable (for suitable values of "readable"). - It's okay if a grammar written for a Japanese application domain, derived from a Japanese specification document and using Japanese identifiers can only be correctly displayed under a Japanese-enabled setup. > 2- Keeping clear distinction between identifiers and keywords: SableCC > requires that identifiers be constructed of lowercase characters, and it > reserves uppercase characters for keywords. This only works well on a > small subset of Unicode letters, as some characters do not have > uppercase/lowercase representations. That's not a problem: Keywords are those that start with an uppercase or titlecase character, and all other identifiers are non-keywords. When you look at the original wording of my suggestion you'll see that I already took this into account. It would only be a problem if you wanted to allow user-definable Unicode keywords. > Worse... According to TR31, the definition of which character is > acceptable in identifiers is not predetermined for yet unassigned code > points. Conforming implementations have to identify the version of TR31 and the version of the Unicode standard, so the set of unassigned code points won't change unless the implementation decides to change that identification. Anyhow the TR31 allows implementations to restrict or extend the set of identifiers as they seem fit. The report only provides a reference point and some guidelines. > 3- SableCC must be able to generate easily usable identifiers in target > languages (the killer argument): Limiting identifiers to the current > ASCII lowercase (with digits and underscore separator) allows for the > most flexible and intuitive identifier selection for each target > language. SableCC can chose to keep lowercases and underscores, or > convert identifiers to camel case, or do any other simply manipulation, > to generate identifiers that match the conventions of the target language. Many present-day target languages do support Unicode identifiers. And not every grammar needs to be intuitively mappable to any language. > Imagine what would happen if SableCC had to generate plain ISO C 89 code > for a grammar containing: > > étienne_keyword = 'Étienne'; Easy, it would generate Latin_Small_Letter_E_With_Acute_tienne_keyword or U00E9tienne_keyword. ;) > Plain portable C code cannot contain unicode characters. Luckily support for universal character names has become quite widespread, so generating \u00E9tienne_keyword might be just fine. But really: Many times a grammar will only be used in a local, restricted environment. The great majority of software development continues to be done in closed-source if not culture-specific environments. I would trust developers to make a choice of identifiers that makes sense for the project or the audience the grammar is targeted at. People won't write their grammar with Thai identifiers if their back end language can't handle it in a useful way, or if the grammar will conceivably end up being used by non-Thai. Just as they will or won't do in the rest of their code in whatever programming lanuage they use. Everything else being equal, people will care much less that their grammar translates ugly to COBOL than being forced to translate or transliterate their domain- or culture-specific terms. > I also think that in generated lexers _*for the Java target*_, SableCC > should provide its own utf8 byte stream reader that would convert byte > sequences into code points stored into Java "int"s (32-bit), IMO it should just take a Reader. I find that, more often than not, what I want to parse is already in Java char sequence form. And even if I do happen to have a UTF-8 byte stream, it's just a matter of new InputStreamReader(myUtf8Stream, "UTF-8") or (to catch invalid UTF-8 sequences) new InputStreamReader(myUtf8Stream, Charset.forName("UTF-8").newDecoder()) > then apply the lexer engine on the code points. For intuitiveness, > the String representation of tokens should be encoded according to > Java usages (UTF-16, right?). Yes; matching the input format. But it's also conceivable that one wants to use SableCC to parse byte- or (machine-)word-based input. In that case a different back end would have to be provided, which for example for the case of byte-based input would take an InputStream, and the tokens would then provide a byte array. > >- Provide an escape syntax for Unicode characters that can be used > >anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as > >in C and C++.) This way any grammar file can trivially be turned into > >an exactly equivalent 7-bit ASCII encoded version for robust transfer > >and exchange. > > Actually, SableCC is even more flexible, it allows for unbounded > character codes using the syntax #1236452837 and #x234876abcdef23894. I'm not talking about character literals. I'm talking about the source representation of the grammar file. For example instead of mot_clé_étienne = 'Étienne'; // lui-même one could write mot_cl\u00E9_\u00E9tienne = '\u00C9tienne'; // lui-m\uu00EAme and possibly also \u0023123\u0036452837 instead of #1236452837 (this is a non-sensical example). : > You're mostly right. SableCC still needs to print error messages, for > erroneous grammars. To do this, it needs to be able to unambiguously > specify a line number and a character number on that line. This requires > level 2 of TR18. Line numbers only requires level 1, but level 2 would > provide the expected behavior for text editors. It would be interesting to validate whether the column index displayed by editors (or the highlighting generated from compiler error messages) actually takes grapheme clusters into account. I suspect that many editors still just use their internal character count. -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageSo far, both sides have interesting points. By the way, many languages have romanisations which cripple their use in technical applicaitons. E.g.: In Chinese, on average 10 to 20 words share the same ASCII romanisation. Like, "buy keyword" and "sell keyword" have the EXACT same typographical romanisation. On the other hand, I know firsthand the frustration of ??? = ??? --non-conformant languages often roll their own encodings (e.g., Myanmar) which is frustrating. The problem as I see it is this: in general purpose languages like C, translating "printf" and "malloc" wouldn't make the language significantly easier to learn. But parser-generators are all about domain-specific custom languages. Asian languages in particular have valid justification for demanding unicode identifiers, while Etienne has sound justification for avoiding them. I vote in favor of them. I feel the user community will simply ignore grammars that use unicode identifiers without good reason. Plus I'm a big fan of well-worded code serving as its own documentation. Is this actually open for discussion, by the way? Please discard my remarks if it's not. :) -->Seth ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageS'orlok Reaves wrote:
> I vote in favor of them. I feel the user community > will simply ignore grammars that use unicode > identifiers without good reason. Plus I'm a big fan of > well-worded code serving as its own documentation. > > Is this actually open for discussion, by the way? > Please discard my remarks if it's not. :) > Of course, it's open for discussion. At this point, what is closed for discussion is the lexer/parser engines of SableCC 4 and anything that would require me to rework the theory. I've done enough of this lately (due to discussions that were not hosted on this list) to dramatically increase the power of the lexer[1]. It is most likely that SableCC 4.0 will be limited to pure ASCII grammar files (due to the amount of work required to support grammars written using Unicode), ensuring upward compatibility with the 4.x (or 5.x) version which will support one form or another of Unicode, depending on the conclusions of discussions as this one. Etienne [1] **** WARNING: This is a teaser. I do not have time to explain all of its details yet. **** In its latest version, the SableCC 4.0 syntax allows for defining recursive comments declaratively, without any user-provided code, e.g. Language test_language; Lexer letter = 'a'..'z' | 'A'..'Z'; digit = '0'..'9'; number = digit+; identifier = letter (letter | digit)*; tab = #9; eol = #13 | #10 | #13 #10; blank = (' ' | tab | eol)+; comment_text = Any+ Exclusion ('/*' | '*/'); Ignored comment, blank; // ignored in the default context Parser file = statement*; statement = identifier '=' exp ';'; exp = number | '(' exp ')' | {binary:} exp op exp; Priority Left Associative binary; op = '+' | '-'; Context comment_body: Token comment = recursive_comment; recursive_comment = '/*' comment_part* '*/'; comment_part = comment_text | recursive_comment; Yep, it's cool. :-) -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageHi Niklas,
Many of your arguments are quite convincing. The only problem, is that I think that proper handling will require more work than I am ready to put before releasing (finally) 4.0. So, I propose a safe alternative for 4.0: grammars written using 7-bit ASCII only. Then, adding well thought Unicode support in later versions. This does not mean that we should stop discussion. On the contrary! Now is the time to do such a discussion so that when 4.0 is released we can start working on implementation, as the design will be ready. See below. Niklas Matthies wrote: > - Converting it to Unicode escape sequences -- there would likely be a > standard tool for this -- will make it readable (for suitable values > of "readable"). > > - It's okay if a grammar written for a Japanese application domain, > derived from a Japanese specification document and using Japanese > identifiers can only be correctly displayed under a Japanese-enabled > setup. > The escape-sequence representation might actually be the solution. Yet, maybe we need more: a mode that converts back and forth from a specific escape representation that would also allow for specifying code-blocks that should be unescaped (the other characters would be escaped). This would be useful, for example, to a Japanese teacher that wants to make sure students did not obfuscate the source using some characters in non-ASCII & non-Japanese blocks. > That's not a problem: Keywords are those that start with an uppercase > or titlecase character, and all other identifiers are non-keywords. > When you look at the original wording of my suggestion you'll see that > I already took this into account. It would only be a problem if you > wanted to allow user-definable Unicode keywords. > Actually, it is still a problem, as some letters do not have two forms (upper/lower). SableCC, so far, converted identifiers to camel case, to create production class names. e.g. some_prod = ...; => PSomeProd So, maybe we should rethink the whole identifier generation approach: some_prod = ...; => P_some_prod // Maybe? But, then, as SableCC keywords will be limited to ASCII, maybe we could allow for "TR31 identifiers" (I know, it's vague, as we can remove many subsets) that do not contain ASCII uppercase letters. We should make sure, though, that we stop doing things such as camel casing identifiers, as it will break as soon as general unicode letters will be introduced. SableCC 4.0 could still use the old approach; name conversion, to adapt a program to a new naming strategy, can easily be automated (a good old "find/replace" usually works quite well; a little sed script can be faster, even). > Conforming implementations have to identify the version of TR31 and > the version of the Unicode standard, so the set of unassigned code > points won't change unless the implementation decides to change that > identification. Anyhow the TR31 allows implementations to restrict or > extend the set of identifiers as they seem fit. The report only > provides a reference point and some guidelines. > Actually, I am more inclined to get SableCC to simply reject unassigned code points (in grammars). As far as I can understand, this provides most stability with evolving Unicode specifications. As for generated lexers, I think that they should either treat input as binary (no line/column), or, again, reject unassigned code points. This requires no specific SableCC front-end support; the flexible code generator back-end can handle this. > But really: Many times a grammar will only be used in a local, > restricted environment. The great majority of software development > continues to be done in closed-source if not culture-specific > environments. I would trust developers to make a choice of identifiers > that makes sense for the project or the audience the grammar is > targeted at. People won't write their grammar with Thai identifiers if > their back end language can't handle it in a useful way, or if the > grammar will conceivably end up being used by non-Thai. Just as they > will or won't do in the rest of their code in whatever programming > lanuage they use. > > Everything else being equal, people will care much less that their > grammar translates ugly to COBOL than being forced to translate or > transliterate their domain- or culture-specific terms. > Right. > IMO it should just take a Reader. I find that, more often than not, > what I want to parse is already in Java char sequence form. And even > if I do happen to have a UTF-8 byte stream, it's just a matter of > Just to make sure we understand each other: SableCC generated lexers work with code points (actually, with 'single' numeric values). So, if input is in UTF-16, it should be converted to pure numeric code points prior to feeding to the lexer engine. This can be hidden in the lexer constructor or whatever that simplifies the life of the programmer, I don't care, but I don't want the lexer table generator to worry about UTF encoding. > But it's also conceivable that one wants to use SableCC to parse > byte- or (machine-)word-based input. In that case a different back end > would have to be provided, which for example for the case of byte-based > input would take an InputStream, and the tokens would then provide a > byte array. > The lexer engine is not tied to any specific integer size (nor any specific source encoding), so this is not a conceptual problem. For the implementation, providing this flexibility can be achieved in 2 ways: 1- As the lexer itself does not care about encoding, the Java target back-end would generate two constructors (or more); one that expect a Reader (that wraps the Reader into a UTF-16 decoder), and one that expects a stream of int (or BigNumber, if int is too small). 2- Provide distinct Java-Unicode & Java-Binary back-ends. > It would be interesting to validate whether the column index displayed > by editors (or the highlighting generated from compiler error messages) > actually takes grapheme clusters into account. I suspect that many > editors still just use their internal character count. > I definitely think that if we are to support Unicode, we should enforve coherent level-2 input (using only assigned code points) in source code. Open question: how flexible should we be with: 1- The content of comments? Do we allow for any valid level-2 input (without unassigned c.p.), even if it contains potentially obfuscating control codes? 2- Should we put some additional restrictions on string literals? I am asking this even for grammars that were first (un)escaped (for specific blocks). For illustration, I do not think that it would be advisable to allow for ASCII characters in the range 0..31 in string literals. Should we put such restrictions to only allow specific classes of characters in strings (note that I have not fully read the Unicode specifications; its quite a huge things...). Etienne -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usage> 2- Should we put some additional restrictions on > string literals? Just a heads-up: characters like U+200C and U+200D (zero-width joiner/non-joiner) are often stripped by string-matching algorithms, so they should probably be excluded from literals. (Note: be aware that there are a very small number of cases where such characters ARE used improperly for different semantics (e.g., Sinhala)). Actually, I have a proposal: why not allow the same characters as in Unicode http addresses? (http://www.w3.org/International/articles/idn-and-iri/) Recall that, for security reasons, these guidelines limit characters which have duplicate ASCII-level equivalents, which helps us for readability. They also have a well-formed normalization function to translate to (predictable) ASCII -maybe this could be an alternative to Camel-case? Thoughts? -->Seth ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageS'orlok Reaves wrote: > Just a heads-up: characters like U+200C and U+200D > (zero-width joiner/non-joiner) are often stripped by > string-matching algorithms, so they should probably be > excluded from literals. (Note: be aware that there are > a very small number of cases where such characters ARE > used improperly for different semantics (e.g., > Sinhala)). > I definitely do not want SableCC to start worrying about "automagically" ignoring code points that appear in string literals. I would say that, if we allow these characters and a user decides to put some of them in a string literal, then they will be matched. But, yes, this is exactly the kind of things I'd like help deciding. Should we allow them or not? Should we worry about "security issues" (TR36)? One possible answer is: No, as a worried user can always convert the input grammar into "specific" escape-sequence encoding and then easily detect any use of code points outside of expected blocks. Another possible answer is: Yes, as a user can always express a regular expression containing banned control code points using numeric #1234 notation. Regardless of the above answer, we will need to prohibit the use of certain characters in string literals: ASCII 0..31, any representation of a new-line, and the apostrophe character " ' ". Now, should we try to detect all non-ASCII " ' "-looking characters and ban them, to prevent visual spoofing, or should we simply let users rely on escape-sequence conversion to protect themselves from such spoofing? That's the question I have no clear arguments for and against. Just to be clear about the level of integration of Unicode I see in SableCC generate lexers: The only autotomatic (and somehow overridable) processing should be the computation of [line:character] token locations, which require TR18 level-2 support in generated lexers. Anything else should be done by the grammar writer (by hand, or using a library of helpers), and/or at the application level (i.e. in the application that uses tree walkers to visit the AST). > Actually, I have a proposal: why not allow the same > characters as in Unicode http addresses? > (http://www.w3.org/International/articles/idn-and-iri/) > > Recall that, for security reasons, these guidelines > limit characters which have duplicate ASCII-level > equivalents, which helps us for readability. They also > have a well-formed normalization function to translate > to (predictable) ASCII -maybe this could be an > alternative to Camel-case? > source to be normalized (NFC for comments/literals, NFKC for identifiers), and then use some easy encoding, as suggested by Niklas, e.g. étienne -> U00E9tienne The problem of CamelCase (or, more exactly: of lowercase identifier parts separated by '_') is: 1- It requires letters to have two unambiguous forms: lowercase and uppercase. This is not true in general, for unicode letters. 2- Its use had a nice side effect: it allowed for a clear visual (and name space) distinction between SableCC keywords and identifiers. 3- It allowed to generate various forms of identifiers that matched the contextual conventions in generated code: CamelCase, lowerUpperUpper, lower_lower_lower, Upperlower_Upperlower, etc. By allowing Unicode identifiers, we eliminate the possibility of camel case transformations, but, we can still do plenty of transformations. The real problem is that of the SableCC keyword name space. It's just that the SableCC 2-3 limitation of lowercase characters, in identifiers, made sense, given that general use of camel case (to eliminate the '_' separator in generated identifiers). In SableCC 4.1+, it would be much less justified. Why? Because of the same arguments that were used to justify the use Unicode identifiers: in a domain specific context, the use of uppercase letters in an identifier might make sense. Worse: we can't anymore depend on camel casing to trick the system in generating a properly uppercased form of the identifier. So, then, if we were to allow uppercase characters in identifiers, how should we encode SableCC keywords? Personally, I dislike the "traditional" approach for SableCC (specifically, not in general). In other words, I would not like to allow all identifiers, except keywords as is traditionally done in programming langages. Why? Because it make expressing the SableCC grammar in SableCC a big pain. It's so nice to write: Parser ... parser = ... ... In SableCC 2-3, all identifiers are allowed within the identifier name space. There are no special identifiers that are banned. [There's only one ugly exception to this rule to avoid a name clash with java.lang.Object.getClass()]. So, if we are to allow Unicode identifiers, we should probably find a new name space for SableCC keywords, such as not to arbitrarily restrict the use of some identifiers. We should also rethink the form of generated identifiers. This means that in generated Java code, the old approach cannot be used anymore: # cannot be done: some_prod -> abstract PSomeProd {...} # maybe??? some_prod -> abstract P_some_prod {...} Now, this also means that characters such as "_" would not be special anymore. I would be in favor of allowing as large a set as possible of Unicode identifiers as specified by TR31, yet restricting this set as to ensure maximum stability. Then, we need a way to generate unambiguous convention-abiding valid identifiers in target languages. For respect of our user base, we should aim at generating easily usable identifiers in Java, and try to depart as little as possible from the old approach, when possible. For example, we could do the following: prod1 -> Pprod1 prod2_with_tail -> Pprod2_with_tail While not being identical to old names, Pprod1 is close enough to the old behavior. Now, the neat thing is that if we were to allow uppercase letter, the grammar writer could write: Prod1 -> PProd1 Prod2WithTail -> PProd2WithTail This would make the transition of a project from using SableCC 2-3 to SableCC 4 much more convenient. But, then, this does require that we solve the keyword name space problem... So many things to think about. I do need your input, Niklas, S'orlok, and all those who know about Unicode. I am not a Unicode specialist, yet, I would like SableCC to provide Unicode support the right way. For practical reasons, Unicode support should not be tied to the lexer generation engine in any way. We should also try to make SableCC robust in face of an ever changing Unicode specification. This means, in my view, limiting support to things such as "assigned code points". Thanks for your input, Etienne -- Etienne M. Gagnon, Ph.D. SableCC: http://sablecc.org SableVM: http://sablevm.org _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageOn Thu 2008-05-08 at 11:21h, S'orlok Reaves wrote on sablecc-discussion:
: > In Chinese, on average 10 to 20 words share the same > ASCII romanisation. Like, "buy keyword" and "sell > keyword" have the EXACT same typographical > romanisation. I've been told that Chinese programmers use digits as tone marks (as in http://en.wikipedia.org/wiki/Pinyin#Numbers_in_place_of_tone_marks) in identifiers to disambiguate. It's certainly quite a hassle. The other problem with Chinese romanizations is that spoken Chinese is actually a rather diverse family of languages or dialects, hence the romanization can be very different from the regional pronounciation. Chinese characters don't have that problem, as they are independent of the pronounciation. For example even a Japanese who doesn't speak Chinese at all would be able to understand Chinese identifiers. In this regard Chinese characters can actually be more "portable" than ASCII. -- Niklas Matthies _______________________________________________ SableCC-Discussion mailing list SableCC-Discussion@... http://lists.sablecc.org/listinfo/sablecc-discussion |
|
|
Re: Lexer generation performance and memory usageby Niklas Matthies |