Lexer generation performance and memory usage

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Lexer generation performance and memory usage

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It appears that I've hit a good test case for lexer generation
performance. I'm extending a language to Unicode identifier syntax
(see http://www.unicode.org/reports/tr31/) using the XID_Start and
XID_Continue character classes. I adopted these fairly straightforward
from the Unicode standard's DerivedCoreProperties.txt[1] to SableCC 3.2
syntax:

    unicode_xid_start =     // as defined by DerivedCoreProperties.txt of Unicode 5.1.0, but restricted to the BMP
        [0x0041..0x005A] |  // L&  [26] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER Z
        [0x0061..0x007A] |  // L&  [26] LATIN SMALL LETTER A..LATIN SMALL LETTER Z
         0x00AA          |  // L&       FEMININE ORDINAL INDICATOR
         0x00B5          |  // L&       MICRO SIGN
         0x00BA          |  // L&       MASCULINE ORDINAL INDICATOR
        [0x00C0..0x00D6] |  // L&  [23] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER O WITH DIAERESIS
        [0x00D8..0x00F6] |  // L&  [31] LATIN CAPITAL LETTER O WITH STROKE..LATIN SMALL LETTER O WITH DIAERESIS
        [0x00F8..0x01BA] |  // L& [195] LATIN SMALL LETTER O WITH STROKE..LATIN SMALL LETTER EZH WITH TAIL
         0x01BB          |  // Lo       LATIN LETTER TWO WITH STROKE
        [0x01BC..0x01BF] |  // L&   [4] LATIN CAPITAL LETTER TONE FIVE..LATIN LETTER WYNN
   
    ... ~650 more lines like this ...

The lexer happens to have five lexer states. After seven long minutes
with 100% CPU, while generating the DFA for the third state SableCC
bails out with java.lang.OutOfMemoryError: Java heap space.

So far so bad.

I suspect that SableCC doesn't recognize that the above is still just
a character class, so one way to improve the lexer specification would
be to use character class syntax with "+" rather than "|" for union.
Étienne, do you think that would help?
Although I have to admit that I'm not really keen on having to
transform the above list into a "+" binary tree...

Another thing is that I was planning to eventually add the characters
outside of the BMP (basic multilingual plane), which with SableCC 3.2
requires them to be handled with surrogate pairs in the lexer grammar,
if I'm not mistaken. So it couldn't be a plain character class any more.

Anyway, maybe SableCC 4 could strive to be able to handle something
like the above. :)

-- Niklas Matthies

[1] http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed 2008-05-07 at 06:58h, I wrote on sablecc-discussion:
:
> I suspect that SableCC doesn't recognize that the above is still just
> a character class, so one way to improve the lexer specification would
> be to use character class syntax with "+" rather than "|" for union.

I just gave this a try, and it really does solve the problem. Though
the lexer specifications now looks like LISP with square brackets. :)

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Niklas,

(1)  For the "heap space" message, don't forget that Sun's Java virtual
machine limits heap size to 16M (!) by default. So, to handle memory
intensive computations, you need to specify a bigger heap :

$ java -Xmx1024M -jar sablecc.jar mygrammar.sablecc

(2) Effectively, SableCC 3 handles such things less than optimally: it
does not minimize the produced DFA, it does not minimize alphabets,
using "|" instead of '+' has dramatic effect on lexer computation (and
produced lexer size), etc. Not that SableCC 4 corrects all this. In
SableCC 4, the explicit character set notation (using "[" and "]") is
simply eliminated, and the lexer engine minimizes both the DFA and the
alphabet.

(3) I've been looking at Unicode, a little bit. I think that SableCC (4,
or 5?) should require grammar files to be encoded in UTF-8.
Additionally, it should eventually have a "unicode" mode that provides
"Extended Unicode Support" (level 2 of
http://unicode.org/unicode/reports/tr18/) to handle "grapheme clusters"
as single characters.

Note that I do not intend to implement Perl regular expressions
(discussed in tr18) in SableCC. Perl regular expressions are NOT regular
expressions (i.e. they do not describe regular languages, in language
theory terms), they are  a form of pattern matching. Regular languages
have the nice property that they can be matched in linear time. In
contrast, pattern matching is much more expensive.

If Unicode grapheme cluster handling is desired by the community, I will
need help implementing it (i.e. figuring out the various details of the
Unicode specification, implementing the code, etc.). For now, I was only
aiming to provide level 1 of tr18 in SableCC 4 (or possibly not even
that, if it was to delay the release too much). Do not forget that
SableCC 4 should generate lexer in a wide variety of target languages,
so generated lexers cannot depend on Java standard libraries to
accomplish their work. Also, I do not like having dependence on third
party libraries, as they can cause head aches to people that worry about
licensing issues.

Etienne

Niklas Matthies wrote:

> It appears that I've hit a good test case for lexer generation
> performance. I'm extending a language to Unicode identifier syntax
> (see http://www.unicode.org/reports/tr31/) using the XID_Start and
> XID_Continue character classes. I adopted these fairly straightforward
> from the Unicode standard's DerivedCoreProperties.txt[1] to SableCC 3.2
> syntax:
> [...]
> The lexer happens to have five lexer states. After seven long minutes
> with 100% CPU, while generating the DFA for the third state SableCC
> bails out with java.lang.OutOfMemoryError: Java heap space.
> [...]
> I suspect that SableCC doesn't recognize that the above is still just
> a character class, so one way to improve the lexer specification would
> be to use character class syntax with "+" rather than "|" for union.
> Étienne, do you think that would help?
> [...]
> Another thing is that I was planning to eventually add the characters
> outside of the BMP (basic multilingual plane), which with SableCC 3.2
> requires them to be handled with surrogate pairs in the lexer grammar,
> if I'm not mistaken. So it couldn't be a plain character class any more.
>
> Anyway, maybe SableCC 4 could strive to be able to handle something
> like the above. :)
>  
--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Lexer generation performance and memory usage

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed 2008-05-07 at 10:58h, Etienne M. Gagnon wrote on sablecc-discussion:
> Hi Niklas,
>
> (1)  For the "heap space" message, don't forget that Sun's Java virtual
> machine limits heap size to 16M (!) by default.

The default maximum heap size depends on a lot of factors; with a
current client VM under 32-bit Windows using the default GC it's
usually 64M.

But this is a good point. Trying again, the process succeeded using
less than 200M of memory, in 13 minutes on my machine. The lexer.dat
is about 2.5 the size of the '+' version.

For a running VM one can find out the maximum heap size with jconsole
for example (part of the JDK). Jconsole also gives you pretty graphs
on the actual memory usage.

:
> (3) I've been looking at Unicode, a little bit. I think that SableCC (4,
> or 5?) should require grammar files to be encoded in UTF-8.

Yes, I also think that mandating UTF-8 would be best. UTF-8 makes it
likely that grammar files with the wrong encoding can actually be
detected (and rejected). It's also a good thing to not make it
configurable (à la XML encoding declaration or with a command line
switch), because this only leads to situations where the declared
encoding does not match the actual encoding, and makes it harder for
editors and other tools to process grammar files using the correct
encoding. Been through that too often.

I have a couple of related suggestions:

- Allow Unicode identifiers. From my recent experience with TR31
I would suggest to allow the intersection of non-uppercase and
non-titlecase Java identifiers with the XID* set from TR31, and to
use NFKC normalization for identifer matching and back end identifier
generation.

- Non-ASCII characters in character literals should be interpreted as
Unicode code points, not as UTF-16 code units. For example the UTF-8
sequence 0xF0 0x9D 0x85 0xA0 (U+1D160) should be interpreted as the
single input character 0x1D160, not as the two-character sequence
0xD834 0xDD60 (which is the UTF-16 encoding of U+1D160). Accordingly
the default back end should convert the sequence of Java chars to
Unicode code points when feeding the input to the lexer.

- Provide an escape syntax for Unicode characters that can be used
anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as
in C and C++.) This way any grammar file can trivially be turned into
an exactly equivalent 7-bit ASCII encoded version for robust transfer
and exchange.

:
> If Unicode grapheme cluster handling is desired by the community, I will
> need help implementing it (i.e. figuring out the various details of the
> Unicode specification, implementing the code, etc.).

It would be nice if Unicode support along the lines of TR18 would be
made available in the form of "Helper" libraries (generated from the
Unicode data files) that can be included in the lexer specification
using some sort of include directive. For example instead of special
regular expression syntax like "\p{letter}", a helper symbol
"unicode_letter" would be made available. This would also make it easy
to support different versions of Unicode independent of the SableCC
version. Languages are often specified relative to a specific Unicode
version, and don't want a newer SableCC version or a different JDK to
silently apply a different version of Unicode.

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Niklas Matthies wrote:
Yes, I also think that mandating UTF-8 would be best. UTF-8 makes it
likely that grammar files with the wrong encoding can actually be
detected (and rejected). It's also a good thing to not make it
configurable (à la XML encoding declaration or with a command line
switch), because this only leads to situations where the declared
encoding does not match the actual encoding, and makes it harder for
editors and other tools to process grammar files using the correct
encoding. Been through that too often.
  

We agree.

- Allow Unicode identifiers. From my recent experience with TR31
I would suggest to allow the intersection of non-uppercase and
non-titlecase Java identifiers with the XID* set from TR31, and to
use NFKC normalization for identifer matching and back end identifier
generation.
  

I think that it would be a big mistake to allow Unicode identifiers. Unicode should be restricted to comments and string literals. Here's why:

1- Readability of grammars: Imagine seeing the following:
Lexer
 ??????? = '???';
 ???_??? = '?????';
...
This is likely to show when reading a grammar with identifiers made of characters that are not installed. (Another symbol than '?' might be used by the reader program). I think that it is less harmful if the only unreadable parts are comments and literals.

2- Keeping clear distinction between identifiers and keywords: SableCC requires that identifiers be constructed of lowercase characters, and it reserves uppercase characters for keywords. This only works well on a small subset of Unicode letters, as some characters do not have uppercase/lowercase representations.

Worse... According to TR31, the definition of which character is acceptable in identifiers is not predetermined for yet unassigned code points. I definitely do not want to push SableCC into such murky areas.

3- SableCC must be able to generate easily usable identifiers in target languages (the killer argument): Limiting identifiers to the current ASCII lowercase (with digits and underscore separator) allows for the most flexible and intuitive identifier selection for each target language. SableCC can chose to keep lowercases and underscores, or convert identifiers to camel case, or do any other simply manipulation, to generate identifiers that match the conventions of the target language.

Imagine what would happen if SableCC had to generate plain ISO C 89 code for a grammar containing:
 étienne_keyword = 'Étienne';
Plain portable C code cannot contain unicode characters. SableCC should be able to generate code in various languages: Java, C#, C++, Ruby, Python, Perl , COBOL, Fortran, Basic, Pascal, Ada, Modula, OCaml, ML, Lisp, Scheme, Haskell, etc. In each language, we want identifiers to be easy to write! We definitely do not want users to have to type something that looks like:
{
  ...
  _eacute_tienne__keyword e = ...;
  ...
}

- Non-ASCII characters in character literals should be interpreted as
Unicode code points, not as UTF-16 code units.

Of course. As I said, SableCC 4 is not Java-specific. It definitely is not defined in terms of choices made (mostly for backward compatibility with 16-bit char type) by the Java API designers.

I also think that in generated lexers for the Java target, SableCC should provide its own utf8 byte stream reader that would convert byte sequences into code points stored into Java "int"s (32-bit), then apply the lexer engine on the code points. For intuitiveness, the String representation of tokens should be encoded according to Java usages (UTF-16, right?).

- Provide an escape syntax for Unicode characters that can be used
anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as
in C and C++.) This way any grammar file can trivially be turned into
an exactly equivalent 7-bit ASCII encoded version for robust transfer
and exchange.
  

Actually, SableCC is even more flexible, it allows for unbounded character codes using the syntax #1236452837 and #x234876abcdef23894.


It would be nice if Unicode support along the lines of TR18 would be
made available in the form of "Helper" libraries (generated from the
Unicode data files) that can be included in the lexer specification
using some sort of include directive. For example instead of special
regular expression syntax like "\p{letter}", a helper symbol
"unicode_letter" would be made available. This would also make it easy
to support different versions of Unicode independent of the SableCC
version. Languages are often specified relative to a specific Unicode
version, and don't want a newer SableCC version or a different JDK to
silently apply a different version of Unicode.
  

You're mostly right. SableCC still needs to print error messages, for erroneous grammars. To do this, it needs to be able to unambiguously specify a line number and a character number on that line. This requires level 2 of TR18. Line numbers only requires level 1, but level 2 would provide the expected behavior for text editors. A consequence of level 2 support would be the rejection of incomplete grapheme clusters in the input grammar. One could not write:
Lexer
 // I use [] around each code point.
 // This is for illustration only; it is not valid SableCC input.
 token = '[acute modifier]'  // the complete grapheme would be '[acute modifier][e]'
So, to put an incomplete grapheme in a regular expression, one would need to use the #123 syntax. The big advantages:
1- Correct display of a grammar in Unicode-aware editors.
2- Unambiguous character number (on a line).
3- No rejection of a grammar file by picky Unicode editors.

On the other hand, I do not plan to support grapheme clusters in ranges. This would be best provided as helper libraries. e.g.
Helpers
 
 // rejected: the lower bound is not a single code point
 some_character_range = '[acute modifier][e]'..'z';
 
 // accepted: (assuming 'a' and 'z' are single code points)
 letter = 'a'..'z';
Anyway, correct handling ranges of grapheme clusters (from a user point of view) requires tr18 level 3 support.

I think that generated lexers need to implement TR18 level 2, too, as to give intuitive line/character numbers to tokens.

But, all this requires much work. This is why I think that SableCC 4 should either:
1- Limit input grammars to 7-bit ASCII, pending Unicode support -> forward compatible.
2- Handle TR18 level 1 (i.e. understand code points and identify line numbers) -> not completely forward compatible, as it would allow for incomplete grapheme clusters.

We can do more in 4.1+ :-)

Etienne
-- 
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Lexer generation performance and memory usage

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu 2008-05-08 at 09:14h, Etienne M. Gagnon wrote on sablecc-discussion:
> Niklas Matthies wrote:
:
> >- Allow Unicode identifiers. From my recent experience with TR31
> >I would suggest to allow the intersection of non-uppercase and
> >non-titlecase Java identifiers with the XID* set from TR31, and to
> >use NFKC normalization for identifer matching and back end identifier
> >generation.
>
> I think that it would be a big mistake to allow Unicode identifiers.
> Unicode should be restricted to comments and string literals.

I used to have that opinion too for a long time, but working in
multi-language projects (German development team working in French
application domains with internal documentation in English and
Luxemburgish customers...), discussions like those surrounding
http://www.python.org/dev/peps/pep-3131/ and learning Japanese
eventually changed my mind.

The major reason is that if you write a grammar for a domain-specific
language, and the domain happens to be language/culture-specific, then
either you have to find translations for the domain-specific terms
into English, which can be awkward and difficult to follow for
everyone familiar with the original domain terms, or you have to
transliterate to US-ASCII, which goes from ugly to unreadable
depending on the native alphabet.

The other reason is for the use in teaching in another language. For
example the Chinese would surely appreciate if they can use Chinese
identifiers when teaching parser generators and LALR grammars. It's
arguably less typical than for general-purpose programming languages,
but I found http://groups.google.de/group/comp.lang.python/msg/ccffec1abd4dd24d
to be a telling occurrence.

Even in German it is a nuisance to have to write stuff like
"schluesselwort" and "grossbuchstabe" instead of "schlüsselwort"
("keyword") and "großbuchstabe" ("uppercase letter"). Now think of
people with arabic or cyrillic or asian scripts.

Here's why:

>
> 1- Readability of grammars: Imagine seeing the following:
>
> Lexer
> ??????? = '???';
> ???_??? = '?????';
> ...
>
> This is likely to show when reading a grammar with identifiers made of
> characters that are not installed. (Another symbol than '?' might be
> used by the reader program). I think that it is less harmful if the only
> /unreadable/ parts are comments and literals.

- Converting it to Unicode escape sequences -- there would likely be a
  standard tool for this -- will make it readable (for suitable values
  of "readable").

- It's okay if a grammar written for a Japanese application domain,
  derived from a Japanese specification document and using Japanese
  identifiers can only be correctly displayed under a Japanese-enabled
  setup.

> 2- Keeping clear distinction between identifiers and keywords: SableCC
> requires that identifiers be constructed of lowercase characters, and it
> reserves uppercase characters for keywords. This only works well on a
> small subset of Unicode letters, as some characters do not have
> uppercase/lowercase representations.

That's not a problem: Keywords are those that start with an uppercase
or titlecase character, and all other identifiers are non-keywords.
When you look at the original wording of my suggestion you'll see that
I already took this into account. It would only be a problem if you
wanted to allow user-definable Unicode keywords.

> Worse... According to TR31, the definition of which character is
> acceptable in identifiers is not predetermined for yet unassigned code
> points.

Conforming implementations have to identify the version of TR31 and
the version of the Unicode standard, so the set of unassigned code
points won't change unless the implementation decides to change that
identification. Anyhow the TR31 allows implementations to restrict or
extend the set of identifiers as they seem fit. The report only
provides a reference point and some guidelines.

> 3- SableCC must be able to generate easily usable identifiers in target
> languages (the killer argument): Limiting identifiers to the current
> ASCII lowercase (with digits and underscore separator) allows for the
> most flexible and intuitive identifier selection for each target
> language. SableCC can chose to keep lowercases and underscores, or
> convert identifiers to camel case, or do any other simply manipulation,
> to generate identifiers that match the conventions of the target language.

Many present-day target languages do support Unicode identifiers. And
not every grammar needs to be intuitively mappable to any language.

> Imagine what would happen if SableCC had to generate plain ISO C 89 code
> for a grammar containing:
>
> étienne_keyword = 'Étienne';

Easy, it would generate Latin_Small_Letter_E_With_Acute_tienne_keyword
or U00E9tienne_keyword. ;)

> Plain portable C code cannot contain unicode characters.

Luckily support for universal character names has become quite
widespread, so generating \u00E9tienne_keyword might be just fine.

But really: Many times a grammar will only be used in a local,
restricted environment. The great majority of software development
continues to be done in closed-source if not culture-specific
environments. I would trust developers to make a choice of identifiers
that makes sense for the project or the audience the grammar is
targeted at. People won't write their grammar with Thai identifiers if
their back end language can't handle it in a useful way, or if the
grammar will conceivably end up being used by non-Thai. Just as they
will or won't do in the rest of their code in whatever programming
lanuage they use.

Everything else being equal, people will care much less that their
grammar translates ugly to COBOL than being forced to translate or
transliterate their domain- or culture-specific terms.

> I also think that in generated lexers _*for the Java target*_, SableCC
> should provide its own utf8 byte stream reader that would convert byte
> sequences into code points stored into Java "int"s (32-bit),

IMO it should just take a Reader. I find that, more often than not,
what I want to parse is already in Java char sequence form. And even
if I do happen to have a UTF-8 byte stream, it's just a matter of

   new InputStreamReader(myUtf8Stream, "UTF-8")

or (to catch invalid UTF-8 sequences)

   new InputStreamReader(myUtf8Stream, Charset.forName("UTF-8").newDecoder())

> then apply the lexer engine on the code points. For intuitiveness,
> the String representation of tokens should be encoded according to
> Java usages (UTF-16, right?).

Yes; matching the input format.

But it's also conceivable that one wants to use SableCC to parse
byte- or (machine-)word-based input. In that case a different back end
would have to be provided, which for example for the case of byte-based
input would take an InputStream, and the tokens would then provide a
byte array.

> >- Provide an escape syntax for Unicode characters that can be used
> >anywhere in the grammar file. (For example \uxxxx and \Uxxxxxxxx as
> >in C and C++.) This way any grammar file can trivially be turned into
> >an exactly equivalent 7-bit ASCII encoded version for robust transfer
> >and exchange.
>
> Actually, SableCC is even more flexible, it allows for unbounded
> character codes using the syntax #1236452837 and #x234876abcdef23894.

I'm not talking about character literals. I'm talking about the
source representation of the grammar file. For example instead of

   mot_clé_étienne = 'Étienne';  // lui-même

one could write

   mot_cl\u00E9_\u00E9tienne = '\u00C9tienne';  // lui-m\uu00EAme

and possibly also \u0023123\u0036452837 instead of #1236452837 (this
is a non-sensical example).

:
> You're mostly right. SableCC still needs to print error messages, for
> erroneous grammars. To do this, it needs to be able to unambiguously
> specify a line number and a character number on that line. This requires
> level 2 of TR18. Line numbers only requires level 1, but level 2 would
> provide the expected behavior for text editors.

It would be interesting to validate whether the column index displayed
by editors (or the highlighting generated from compiler error messages)
actually takes grapheme clusters into account. I suspect that many
editors still just use their internal character count.

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by S'orlok Reaves :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


So far, both sides have interesting points. By the
way, many languages have romanisations which cripple
their use in technical applicaitons. E.g.:

In Chinese, on average 10 to 20 words share the same
ASCII romanisation. Like, "buy keyword" and "sell
keyword" have the EXACT same typographical
romanisation.

On the other hand, I know firsthand the frustration of
??? = ??? --non-conformant languages often roll their
own encodings (e.g., Myanmar) which is frustrating.


The problem as I see it is this: in general purpose
languages like C, translating "printf" and "malloc"
wouldn't make the language significantly easier to
learn. But parser-generators are all about
domain-specific custom languages. Asian languages in
particular have valid justification for demanding
unicode identifiers, while Etienne has sound
justification for avoiding them.

I vote in favor of them. I feel the user community
will simply ignore grammars that use unicode
identifiers without good reason. Plus I'm a big fan of
well-worded code serving as its own documentation.

Is this actually open for discussion, by the way?
Please discard my remarks if it's not. :)
-->Seth



      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

S'orlok Reaves wrote:
> I vote in favor of them. I feel the user community
> will simply ignore grammars that use unicode
> identifiers without good reason. Plus I'm a big fan of
> well-worded code serving as its own documentation.
>
> Is this actually open for discussion, by the way?
> Please discard my remarks if it's not. :)
>  
Of course, it's open for discussion.

At this point, what is closed for discussion is the lexer/parser engines
of SableCC 4 and anything that would require me to rework the theory.
I've done enough of this lately (due to discussions that were not hosted
on this list) to dramatically increase the power of the lexer[1].

It is most likely that SableCC 4.0 will be limited to pure ASCII grammar
files (due to the amount of work required to support grammars written
using Unicode), ensuring upward compatibility with the 4.x (or 5.x)
version which will support one form or another of Unicode, depending on
the conclusions of discussions as this one.

Etienne

[1] **** WARNING: This is a teaser. I do not have time to explain all of
its details yet. ****
In its latest version, the SableCC 4.0 syntax allows for defining
recursive comments declaratively, without any user-provided code, e.g.

Language test_language;

Lexer

  letter = 'a'..'z' | 'A'..'Z';
  digit = '0'..'9';
  number = digit+;
  identifier = letter (letter | digit)*;
  tab = #9;
  eol = #13 | #10 | #13 #10;
  blank = (' ' | tab | eol)+;
  comment_text = Any+ Exclusion ('/*' | '*/');

  Ignored
   comment, blank;  // ignored in the default context

Parser

  file = statement*;

  statement = identifier '=' exp ';';

  exp =
    number | '(' exp ')' |
    {binary:} exp op exp;

   Priority
    Left Associative binary;

  op = '+' | '-';

 Context comment_body:

  Token comment = recursive_comment;

  recursive_comment = '/*' comment_part* '*/';

  comment_part = comment_text | recursive_comment;

Yep, it's cool. :-)

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Lexer generation performance and memory usage

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Niklas,

Many of your arguments are quite convincing. The only problem, is that I
think that proper handling will require more work than I am ready to put
before releasing (finally) 4.0. So, I propose a safe alternative for
4.0: grammars written using 7-bit ASCII only. Then, adding well thought
Unicode support in later versions.

This does not mean that we should stop discussion. On the contrary! Now
is the time to do such a discussion so that when 4.0 is released we can
start working on implementation, as the design will be ready.

See below.

Niklas Matthies wrote:
> - Converting it to Unicode escape sequences -- there would likely be a
>   standard tool for this -- will make it readable (for suitable values
>   of "readable").
>
> - It's okay if a grammar written for a Japanese application domain,
>   derived from a Japanese specification document and using Japanese
>   identifiers can only be correctly displayed under a Japanese-enabled
>   setup.
>  
The escape-sequence representation might actually be the solution. Yet,
maybe we need more: a mode that converts back and forth from a specific
escape representation that would also allow for specifying code-blocks
that should be unescaped (the other characters would be escaped).

This would be useful, for example, to a Japanese teacher that wants to
make sure students did not obfuscate the source using some characters in
non-ASCII & non-Japanese blocks.

> That's not a problem: Keywords are those that start with an uppercase
> or titlecase character, and all other identifiers are non-keywords.
> When you look at the original wording of my suggestion you'll see that
> I already took this into account. It would only be a problem if you
> wanted to allow user-definable Unicode keywords.
>  
Actually, it is still a problem, as some letters do not have two forms
(upper/lower).

SableCC, so far, converted identifiers to camel case, to create
production class names. e.g.

 some_prod = ...;
=>
 PSomeProd

So, maybe we should rethink the whole identifier generation approach:

 some_prod = ...;
=>
 P_some_prod  // Maybe?

But, then, as SableCC keywords will be limited to ASCII, maybe we could
allow for "TR31 identifiers" (I know, it's vague, as we can remove many
subsets) that do not contain ASCII uppercase letters. We should make
sure, though, that we stop doing things such as camel casing
identifiers, as it will break as soon as general unicode letters will be
introduced.

SableCC 4.0 could still use the old approach; name conversion, to adapt
a program to a new naming strategy, can easily be automated (a good old
"find/replace" usually works quite well; a little sed script can be
faster, even).

> Conforming implementations have to identify the version of TR31 and
> the version of the Unicode standard, so the set of unassigned code
> points won't change unless the implementation decides to change that
> identification. Anyhow the TR31 allows implementations to restrict or
> extend the set of identifiers as they seem fit. The report only
> provides a reference point and some guidelines.
>  

Actually, I am more inclined to get SableCC to simply reject unassigned
code points (in grammars). As far as I can understand, this provides
most stability with evolving Unicode specifications.

As for generated lexers, I think that they should either treat input as
binary (no line/column), or, again, reject unassigned code points. This
requires no specific SableCC front-end support; the flexible code
generator back-end can handle this.

> But really: Many times a grammar will only be used in a local,
> restricted environment. The great majority of software development
> continues to be done in closed-source if not culture-specific
> environments. I would trust developers to make a choice of identifiers
> that makes sense for the project or the audience the grammar is
> targeted at. People won't write their grammar with Thai identifiers if
> their back end language can't handle it in a useful way, or if the
> grammar will conceivably end up being used by non-Thai. Just as they
> will or won't do in the rest of their code in whatever programming
> lanuage they use.
>  
OK.

> Everything else being equal, people will care much less that their
> grammar translates ugly to COBOL than being forced to translate or
> transliterate their domain- or culture-specific terms.
>  

Right.

> IMO it should just take a Reader. I find that, more often than not,
> what I want to parse is already in Java char sequence form. And even
> if I do happen to have a UTF-8 byte stream, it's just a matter of
>  

Just to make sure we understand each other: SableCC generated lexers
work with code points (actually, with 'single' numeric values). So, if
input is in UTF-16, it should be converted to pure numeric code points
prior to feeding to the lexer engine. This can be hidden in the lexer
constructor or whatever that simplifies the life of the programmer, I
don't care, but I don't want the lexer table generator to worry about
UTF encoding.

> But it's also conceivable that one wants to use SableCC to parse
> byte- or (machine-)word-based input. In that case a different back end
> would have to be provided, which for example for the case of byte-based
> input would take an InputStream, and the tokens would then provide a
> byte array.
>  

The lexer engine is not tied to any specific integer size (nor any
specific source encoding), so this is not a conceptual problem.

For the implementation, providing this flexibility can be achieved in 2
ways:
1- As the lexer itself does not care about encoding, the Java target
back-end would generate two constructors (or more); one that expect a
Reader (that wraps the Reader into a UTF-16 decoder), and one that
expects a stream of int (or BigNumber, if int is too small).
2- Provide distinct Java-Unicode & Java-Binary back-ends.

> It would be interesting to validate whether the column index displayed
> by editors (or the highlighting generated from compiler error messages)
> actually takes grapheme clusters into account. I suspect that many
> editors still just use their internal character count.
>  

I definitely think that if we are to support Unicode, we should enforve
coherent level-2 input (using only assigned code points) in source code.

Open question: how flexible should we be with:
1- The content of comments? Do we allow for any valid level-2 input
(without unassigned c.p.), even if it contains potentially obfuscating
control codes?
2- Should we put some additional restrictions on string literals?

I am asking this even for grammars that were first (un)escaped (for
specific blocks).

For illustration, I do not think that it would be advisable to allow for
ASCII characters in the range 0..31 in string literals. Should we put
such restrictions to only allow specific classes of characters in
strings (note that I have not fully read the Unicode specifications; its
quite a huge things...).

Etienne

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Lexer generation performance and memory usage

by S'orlok Reaves :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


> 2- Should we put some additional restrictions on
> string literals?
Just a heads-up: characters like U+200C and U+200D
(zero-width joiner/non-joiner) are often stripped by
string-matching algorithms, so they should probably be
excluded from literals. (Note: be aware that there are
a very small number of cases where such characters ARE
used improperly for different semantics (e.g.,
Sinhala)).


Actually, I have a proposal: why not allow the same
characters as in Unicode http addresses?
(http://www.w3.org/International/articles/idn-and-iri/)

Recall that, for security reasons, these guidelines
limit characters which have duplicate ASCII-level
equivalents, which helps us for readability. They also
have a well-formed normalization function to translate
to (predictable) ASCII -maybe this could be an
alternative to Camel-case?

Thoughts?
-->Seth



      ____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



S'orlok Reaves wrote:
> Just a heads-up: characters like U+200C and U+200D
> (zero-width joiner/non-joiner) are often stripped by
> string-matching algorithms, so they should probably be
> excluded from literals. (Note: be aware that there are
> a very small number of cases where such characters ARE
> used improperly for different semantics (e.g.,
> Sinhala)).
>  

I definitely do not want SableCC to start worrying about "automagically"
ignoring code points that appear in string literals. I would say that,
if we allow these characters and a user decides to put some of them in a
string literal, then they will be matched.

But, yes, this is exactly the kind of things I'd like help deciding.
Should we allow them or not? Should we worry about "security issues" (TR36)?

One possible answer is: No, as a worried user can always convert the
input grammar into "specific" escape-sequence encoding and then easily
detect any use of code points outside of expected blocks.

Another possible answer is: Yes, as a user can always express a regular
expression containing banned control code points using numeric #1234
notation.

Regardless of the above answer, we will need to prohibit the use of
certain characters in string literals: ASCII 0..31, any representation
of a new-line, and the apostrophe character " ' ". Now, should we try to
detect all non-ASCII " ' "-looking characters and ban them, to prevent
visual spoofing, or should we simply let users rely on escape-sequence
conversion to protect themselves from such spoofing? That's the question
I have no clear arguments for and against.


Just to be clear about the level of integration of Unicode I see in
SableCC generate lexers: The only autotomatic (and somehow overridable)
processing should be the computation of [line:character] token
locations, which require TR18 level-2 support in generated lexers.
Anything else should be done by the grammar writer (by hand, or using a
library of helpers), and/or at the application level (i.e. in the
application that uses tree walkers to visit the AST).

> Actually, I have a proposal: why not allow the same
> characters as in Unicode http addresses?
> (http://www.w3.org/International/articles/idn-and-iri/)
>
> Recall that, for security reasons, these guidelines
> limit characters which have duplicate ASCII-level
> equivalents, which helps us for readability. They also
> have a well-formed normalization function to translate
> to (predictable) ASCII -maybe this could be an
> alternative to Camel-case?
>  
Transformation to ASCII is not a problem in itself; we can mandate the
source to be normalized (NFC for comments/literals, NFKC for
identifiers), and then use some easy encoding, as suggested by Niklas, e.g.

étienne -> U00E9tienne

The problem of CamelCase (or, more exactly: of lowercase identifier
parts separated by '_') is:
1- It requires letters to have two unambiguous forms: lowercase and
uppercase. This is not true in general, for unicode letters.
2- Its use had a nice side effect: it allowed for a clear visual (and
name space) distinction between SableCC keywords and identifiers.
3- It allowed to generate various forms of identifiers that matched the
contextual conventions in generated code: CamelCase, lowerUpperUpper,
lower_lower_lower, Upperlower_Upperlower, etc.

By allowing Unicode identifiers, we eliminate the possibility of camel
case transformations, but, we can still do plenty of transformations.

The real problem is that of the SableCC keyword name space. It's just
that the SableCC 2-3 limitation of lowercase characters, in identifiers,
made sense, given that general use of camel case (to eliminate the '_'
separator in generated identifiers). In SableCC 4.1+, it would be much
less justified. Why? Because of the same arguments that were used to
justify the use Unicode identifiers: in a domain specific context, the
use of uppercase letters in an identifier might make sense. Worse: we
can't anymore depend on camel casing to trick the system in generating a
properly uppercased form of the identifier.

So, then, if we were to allow uppercase characters in identifiers, how
should we encode SableCC keywords?

Personally, I dislike the "traditional" approach for SableCC
(specifically, not in general). In other words, I would not like to
allow all identifiers, except keywords as is traditionally done in
programming langages. Why? Because it make expressing the SableCC
grammar in SableCC a big pain. It's so nice to write:

Parser
...
  parser = ...
...

In SableCC 2-3, all identifiers are allowed within the identifier name
space. There are no special identifiers that are banned. [There's only
one ugly exception to this rule to avoid a name clash with
java.lang.Object.getClass()].

So, if we are to allow Unicode identifiers, we should probably find a
new name space for SableCC keywords, such as not to arbitrarily restrict
the use of some identifiers.

We should also rethink the form of generated identifiers. This means
that in generated Java code, the old approach cannot be used anymore:

# cannot be done:
some_prod  ->  abstract PSomeProd {...}

# maybe???
some_prod  ->  abstract P_some_prod {...}

Now, this also means that characters such as "_" would not be special
anymore. I would be in favor of allowing as large a set as possible of
Unicode identifiers as specified by TR31, yet restricting this set as to
ensure maximum stability.

Then, we need a way to generate unambiguous convention-abiding valid
identifiers in target languages. For respect of our user base, we should
aim at generating easily usable identifiers in Java, and try to depart
as little as possible from the old approach, when possible. For example,
we could do the following:

prod1 -> Pprod1
prod2_with_tail -> Pprod2_with_tail

While not being identical to old names, Pprod1 is close enough to the
old behavior. Now, the neat thing is that if we were to allow uppercase
letter, the grammar writer could write:

Prod1 -> PProd1
Prod2WithTail -> PProd2WithTail

This would make the transition of a project from using SableCC 2-3 to
SableCC 4 much more convenient. But, then, this does require that we
solve the keyword name space problem...

So many things to think about. I do need your input, Niklas, S'orlok,
and all those who know about Unicode. I am not a Unicode specialist,
yet, I would like SableCC to provide Unicode support the right way. For
practical reasons, Unicode support should not be tied to the lexer
generation engine in any way. We should also try to make SableCC robust
in face of an ever changing Unicode specification. This means, in my
view, limiting support to things such as "assigned code points".

Thanks for your input,

Etienne

--
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org




_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Lexer generation performance and memory usage

by Niklas Matthies :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Thu 2008-05-08 at 11:21h, S'orlok Reaves wrote on sablecc-discussion:
:
> In Chinese, on average 10 to 20 words share the same
> ASCII romanisation. Like, "buy keyword" and "sell
> keyword" have the EXACT same typographical
> romanisation.

I've been told that Chinese programmers use digits as tone marks (as
in http://en.wikipedia.org/wiki/Pinyin#Numbers_in_place_of_tone_marks)
in identifiers to disambiguate. It's certainly quite a hassle.

The other problem with Chinese romanizations is that spoken Chinese is
actually a rather diverse family of languages or dialects, hence the
romanization can be very different from the regional pronounciation.
Chinese characters don't have that problem, as they are independent of
the pronounciation. For example even a Japanese who doesn't speak
Chinese at all would be able to understand Chinese identifiers. In
this regard Chinese characters can actually be more "portable" than
ASCII.

-- Niklas Matthies

_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Lexer generation performance and memory usage

by Niklas Matthies