Silly error (?)

View: New views
9 Messages — Rating Filter:   Alert me  

Silly error (?)

by Sebastián PEÑA SALDARRIAGA :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hello,

I'm currently implementing a parser for unipen files
(http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
sablecc. I made a grammar that gives me this kind of errors :

[3,10] expecting: string
[2,8] expecting: number

With a file that starts with :

.ENCODING UTF8
.COORD X Y P T
.SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
.PEN_DOWN

My productions for coord and segment statements looks like this :

unipen_file = simple_statement*
    ;
  simple_statement = {extra} extra_statement (...)
    | {mandatory} mandatory_statement
    | (...)
    | {annotation} annotation_statement
    | (...)
    ;
  extra_statement = encoding string
    ;
  mandatory_statement = (...)
    | {coordinates} coord coord_types+
    ;
annotation_statement = (...)
    | {segment} segment string delimit quality label
    | (...)
    ;

Where coord_types is like 'X' | 'Y' etc. Strings and labels are defined
as specified by the def file :
escape_seq = ('\"' | '\n' | '\t');
lbl_char   = [all - quote] | escape_seq;
lbl        = lbl_char+;
not_cr_lf  = [all - [cr + lf]];
not_tab    = [not_cr_lf - tab];
str_char   = [not_tab - ' '];
str        = ([str_char - '.'] str_char*);
label       = quote lbl quote;


If I use the * operator instead of + with coord_types I get a "[2,8]
expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
is well parsed and the others don't ? Someone has any ideas ?

Sebastian PENA


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Debug Lexer [Was: Silly error (?)]

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Sebastian,

Have you, first, checked that the lexer is doing what you expect?  Please use a debugging lexer :
Often, it is the lexer that is wrong, when the parser seems to act inappropriately. :-)

Have fun!

Etienne

Sebastián PEÑA SALDARRIAGA wrote:
Hello,

I'm currently implementing a parser for unipen files 
(http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with 
sablecc. I made a grammar that gives me this kind of errors :

[3,10] expecting: string
[2,8] expecting: number

With a file that starts with :

.ENCODING UTF8
.COORD X Y P T
.SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
.PEN_DOWN

My productions for coord and segment statements looks like this :

unipen_file = simple_statement*
    ;
  simple_statement = {extra} extra_statement (...)
    | {mandatory} mandatory_statement
    | (...)
    | {annotation} annotation_statement
    | (...)
    ;
  extra_statement = encoding string
    ;
  mandatory_statement = (...)
    | {coordinates} coord coord_types+
    ;
annotation_statement = (...)
    | {segment} segment string delimit quality label
    | (...)
    ;

Where coord_types is like 'X' | 'Y' etc. Strings and labels are defined 
as specified by the def file :
escape_seq = ('\"' | '\n' | '\t');
lbl_char   = [all - quote] | escape_seq;
lbl        = lbl_char+;
not_cr_lf  = [all - [cr + lf]];
not_tab    = [not_cr_lf - tab];
str_char   = [not_tab - ' '];
str        = ([str_char - '.'] str_char*);
label       = quote lbl quote;


If I use the * operator instead of + with coord_types I get a "[2,8] 
expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement 
is well parsed and the others don't ? Someone has any ideas ?

Sebastian PENA


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

  

-- 
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment

Re: Debug Lexer [Was: Silly error (?)]

by Sebastián PEÑA SALDARRIAGA :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Étienne,

Well it seems that the lexer is wrong. Cause when more than one regular
expression matches a token, the longest match is selected, the free_text
token is always the first one found. The definition of string/free
text/label in the original format definition is actually quite loose, it
leads to lots of ambiguities.

Here's a simplified version of my grammar, do you see any workarounds ?

Package fr.lina.phdteshis.unipen.parsing;

Helpers
 
  /** End of line */
  cr  = 13;
  lf  = 10;
  eol = (cr lf | cr | lf);
 
  /** General helpers */
  tab        = 9;
  quote      = '"';
  all        = [0 .. 0xFFFF];
  digit      = [0 .. 9];
  spaces     = (cr | lf | ' ' | tab);
  escape_seq = ('\"' | '\n' | '\t');
  lbl_char   = [all - quote] | escape_seq;
  lbl        = lbl_char+;
  not_cr_lf  = [all - [cr + lf]];
  not_tab    = [not_cr_lf - tab];
  str_char   = [not_tab - ' '];
  str        = ([str_char - '.'] str_char*);
  dot        = '.';
  sign       = ('+' | '-');
 
  /** Argument bases */
  unknown   = '?';
  quality_h = ('BAD' | 'OK' | 'GOOD');
 
  /** Stroke delineation */
  comma         = ',';
  minus         = '-';
  base_nbr      = digit+;
  nbr_in_stroke = ':' base_nbr;
  stroke_nbr    = base_nbr nbr_in_stroke?;
  delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
(((stroke_nbr minus stroke_nbr) | (base_nbr)));
 
  /** Keyword declarations */
  coord_h     = 'COORD';
  pen_down_h  = 'PEN_DOWN';
  pen_up_h    = 'PEN_UP';
  segment_h   = 'SEGMENT';
  enconding_h = 'ENCODING';
  comment_h   = 'COMMENT';
 
Tokens
 
  /** Argument types */
  number    = sign? digit* dot? digit+;
  label     = quote lbl quote;
  string    = str;
  free_text = (str spaces)+ str;
  delimit   = delimit_h | unknown;
  quality   = quality_h | unknown;
 
  /** Coord types */
  x      = 'X';
  y      = 'Y';
  p      = 'P';
  t      = 'T';
  z      = 'Z';
  b      = 'B';
  button = 'BUTTON';
  rho    = 'RHO';
  theta  = 'THETA';
  phi    = 'PHI';

 
  /** Ignored keywords as tokens */
  comment      = dot comment_h (str spaces)* str;
  blank        = spaces+;
 
  /** Keywords as tokens */
  coord    = dot coord_h;
  pen_down = dot pen_down_h;
  pen_up   = dot pen_up_h;
  segment  = dot segment_h;
  encoding = dot enconding_h;
 
Ignored Tokens
  blank,
  comment;
 
Productions
 
  unipen_file = simple_statement*
    ;
 
  simple_statement = {encoding} encoding string
    | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
    | {pen} pen_down number+ pen_up
    | {segment} segment string delimit quality label
    ;

Thanks in advance,

Sebastian

Etienne M. Gagnon a écrit :

> Hi Sebastian,
>
> Have you, first, checked that the lexer is doing what you expect?  
> Please use a debugging lexer :
>
>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>
> Often, it is the lexer that is wrong, when the parser seems to act
> inappropriately. :-)
>
> Have fun!
>
> Etienne
>
> Sebastián PEÑA SALDARRIAGA wrote:
>> Hello,
>>
>> I'm currently implementing a parser for unipen files
>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>> sablecc. I made a grammar that gives me this kind of errors :
>>
>> [3,10] expecting: string
>> [2,8] expecting: number
>>
>> With a file that starts with :
>>
>> .ENCODING UTF8
>> .COORD X Y P T
>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>> .PEN_DOWN
>>
>> My productions for coord and segment statements looks like this :
>>
>> unipen_file = simple_statement*
>>     ;
>>   simple_statement = {extra} extra_statement (...)
>>     | {mandatory} mandatory_statement
>>     | (...)
>>     | {annotation} annotation_statement
>>     | (...)
>>     ;
>>   extra_statement = encoding string
>>     ;
>>   mandatory_statement = (...)
>>     | {coordinates} coord coord_types+
>>     ;
>> annotation_statement = (...)
>>     | {segment} segment string delimit quality label
>>     | (...)
>>     ;
>>
>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are defined
>> as specified by the def file :
>> escape_seq = ('\"' | '\n' | '\t');
>> lbl_char   = [all - quote] | escape_seq;
>> lbl        = lbl_char+;
>> not_cr_lf  = [all - [cr + lf]];
>> not_tab    = [not_cr_lf - tab];
>> str_char   = [not_tab - ' '];
>> str        = ([str_char - '.'] str_char*);
>> label       = quote lbl quote;
>>
>>
>> If I use the * operator instead of + with coord_types I get a "[2,8]
>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>> is well parsed and the others don't ? Someone has any ideas ?
>>
>> Sebastian PENA
>>
>>
>> _______________________________________________
>> SableCC-Discussion mailing list
>> SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>
>>  
>
> --
> Etienne M. Gagnon, Ph.D.
> SableCC:                                            http://sablecc.org
> SableVM:                                            http://sablevm.org
>  
> ------------------------------------------------------------------------
>
> _______________________________________________
> SableCC-Discussion mailing list
> SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Debug Lexer [Was: Silly error (?)]

by Sebastián PEÑA SALDARRIAGA :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I tried to cheat by disallowing the use of digits and ? as the first
characters of a string. I also moved the 'unknown' token to the
Productions and it works well but... since parameters of the .COORD
statement are letters there's no way to avoid the default behaviour of
the lexer. That shi**y format cannot be parsed unless there's any way to
prioritize tokens by the context of the production rules they're in. By
now the only solution I see is hand-crafted spaghetti code.

Sebastian

Sebastián PEÑA SALDARRIAGA a écrit :

> Hi Étienne,
>
> Well it seems that the lexer is wrong. Cause when more than one regular
> expression matches a token, the longest match is selected, the free_text
> token is always the first one found. The definition of string/free
> text/label in the original format definition is actually quite loose, it
> leads to lots of ambiguities.
>
> Here's a simplified version of my grammar, do you see any workarounds ?
>
> Package fr.lina.phdteshis.unipen.parsing;
>
> Helpers
>  
>   /** End of line */
>   cr  = 13;
>   lf  = 10;
>   eol = (cr lf | cr | lf);
>  
>   /** General helpers */
>   tab        = 9;
>   quote      = '"';
>   all        = [0 .. 0xFFFF];
>   digit      = [0 .. 9];
>   spaces     = (cr | lf | ' ' | tab);
>   escape_seq = ('\"' | '\n' | '\t');
>   lbl_char   = [all - quote] | escape_seq;
>   lbl        = lbl_char+;
>   not_cr_lf  = [all - [cr + lf]];
>   not_tab    = [not_cr_lf - tab];
>   str_char   = [not_tab - ' '];
>   str        = ([str_char - '.'] str_char*);
>   dot        = '.';
>   sign       = ('+' | '-');
>  
>   /** Argument bases */
>   unknown   = '?';
>   quality_h = ('BAD' | 'OK' | 'GOOD');
>  
>   /** Stroke delineation */
>   comma         = ',';
>   minus         = '-';
>   base_nbr      = digit+;
>   nbr_in_stroke = ':' base_nbr;
>   stroke_nbr    = base_nbr nbr_in_stroke?;
>   delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
> (((stroke_nbr minus stroke_nbr) | (base_nbr)));
>  
>   /** Keyword declarations */
>   coord_h     = 'COORD';
>   pen_down_h  = 'PEN_DOWN';
>   pen_up_h    = 'PEN_UP';
>   segment_h   = 'SEGMENT';
>   enconding_h = 'ENCODING';
>   comment_h   = 'COMMENT';
>  
> Tokens
>  
>   /** Argument types */
>   number    = sign? digit* dot? digit+;
>   label     = quote lbl quote;
>   string    = str;
>   free_text = (str spaces)+ str;
>   delimit   = delimit_h | unknown;
>   quality   = quality_h | unknown;
>  
>   /** Coord types */
>   x      = 'X';
>   y      = 'Y';
>   p      = 'P';
>   t      = 'T';
>   z      = 'Z';
>   b      = 'B';
>   button = 'BUTTON';
>   rho    = 'RHO';
>   theta  = 'THETA';
>   phi    = 'PHI';
>
>  
>   /** Ignored keywords as tokens */
>   comment      = dot comment_h (str spaces)* str;
>   blank        = spaces+;
>  
>   /** Keywords as tokens */
>   coord    = dot coord_h;
>   pen_down = dot pen_down_h;
>   pen_up   = dot pen_up_h;
>   segment  = dot segment_h;
>   encoding = dot enconding_h;
>  
> Ignored Tokens
>   blank,
>   comment;
>  
> Productions
>  
>   unipen_file = simple_statement*
>     ;
>  
>   simple_statement = {encoding} encoding string
>     | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
>     | {pen} pen_down number+ pen_up
>     | {segment} segment string delimit quality label
>     ;
>
> Thanks in advance,
>
> Sebastian
>
> Etienne M. Gagnon a écrit :
>  
>> Hi Sebastian,
>>
>> Have you, first, checked that the lexer is doing what you expect?  
>> Please use a debugging lexer :
>>
>>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>>
>> Often, it is the lexer that is wrong, when the parser seems to act
>> inappropriately. :-)
>>
>> Have fun!
>>
>> Etienne
>>
>> Sebastián PEÑA SALDARRIAGA wrote:
>>    
>>> Hello,
>>>
>>> I'm currently implementing a parser for unipen files
>>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>>> sablecc. I made a grammar that gives me this kind of errors :
>>>
>>> [3,10] expecting: string
>>> [2,8] expecting: number
>>>
>>> With a file that starts with :
>>>
>>> .ENCODING UTF8
>>> .COORD X Y P T
>>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>>> .PEN_DOWN
>>>
>>> My productions for coord and segment statements looks like this :
>>>
>>> unipen_file = simple_statement*
>>>     ;
>>>   simple_statement = {extra} extra_statement (...)
>>>     | {mandatory} mandatory_statement
>>>     | (...)
>>>     | {annotation} annotation_statement
>>>     | (...)
>>>     ;
>>>   extra_statement = encoding string
>>>     ;
>>>   mandatory_statement = (...)
>>>     | {coordinates} coord coord_types+
>>>     ;
>>> annotation_statement = (...)
>>>     | {segment} segment string delimit quality label
>>>     | (...)
>>>     ;
>>>
>>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are defined
>>> as specified by the def file :
>>> escape_seq = ('\"' | '\n' | '\t');
>>> lbl_char   = [all - quote] | escape_seq;
>>> lbl        = lbl_char+;
>>> not_cr_lf  = [all - [cr + lf]];
>>> not_tab    = [not_cr_lf - tab];
>>> str_char   = [not_tab - ' '];
>>> str        = ([str_char - '.'] str_char*);
>>> label       = quote lbl quote;
>>>
>>>
>>> If I use the * operator instead of + with coord_types I get a "[2,8]
>>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>>> is well parsed and the others don't ? Someone has any ideas ?
>>>
>>> Sebastian PENA
>>>
>>>
>>> _______________________________________________
>>> SableCC-Discussion mailing list
>>> SableCC-Discussion@...
>>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>>
>>>  
>>>      
>> --
>> Etienne M. Gagnon, Ph.D.
>> SableCC:                                            http://sablecc.org
>> SableVM:                                            http://sablevm.org
>>  
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> SableCC-Discussion mailing list
>> SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>  
>>    
>
>
>
> _______________________________________________
> SableCC-Discussion mailing list
> SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

RE: Debug Lexer [Was: Silly error (?)]

by Christopher Van Kirk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

What is the purpose of the free_text production? That seems to be the root
of your problem. Usually such broadly defined tokens are delimited in some
way, e.g. with quotations, braces or parens.

-----Original Message-----
From:
sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@...
[mailto:sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@....
org] On Behalf Of Sebastián PEÑA SALDARRIAGA
Sent: Thursday, February 14, 2008 7:06 PM
To: Discussion mailing list for the SableCC project.
Subject: Re: Debug Lexer [Was: Silly error (?)]


Hi Étienne,

Well it seems that the lexer is wrong. Cause when more than one regular
expression matches a token, the longest match is selected, the free_text
token is always the first one found. The definition of string/free
text/label in the original format definition is actually quite loose, it
leads to lots of ambiguities.

Here's a simplified version of my grammar, do you see any workarounds ?

Package fr.lina.phdteshis.unipen.parsing;

Helpers
 
  /** End of line */
  cr  = 13;
  lf  = 10;
  eol = (cr lf | cr | lf);
 
  /** General helpers */
  tab        = 9;
  quote      = '"';
  all        = [0 .. 0xFFFF];
  digit      = [0 .. 9];
  spaces     = (cr | lf | ' ' | tab);
  escape_seq = ('\"' | '\n' | '\t');
  lbl_char   = [all - quote] | escape_seq;
  lbl        = lbl_char+;
  not_cr_lf  = [all - [cr + lf]];
  not_tab    = [not_cr_lf - tab];
  str_char   = [not_tab - ' '];
  str        = ([str_char - '.'] str_char*);
  dot        = '.';
  sign       = ('+' | '-');
 
  /** Argument bases */
  unknown   = '?';
  quality_h = ('BAD' | 'OK' | 'GOOD');
 
  /** Stroke delineation */
  comma         = ',';
  minus         = '-';
  base_nbr      = digit+;
  nbr_in_stroke = ':' base_nbr;
  stroke_nbr    = base_nbr nbr_in_stroke?;
  delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
(((stroke_nbr minus stroke_nbr) | (base_nbr)));
 
  /** Keyword declarations */
  coord_h     = 'COORD';
  pen_down_h  = 'PEN_DOWN';
  pen_up_h    = 'PEN_UP';
  segment_h   = 'SEGMENT';
  enconding_h = 'ENCODING';
  comment_h   = 'COMMENT';
 
Tokens
 
  /** Argument types */
  number    = sign? digit* dot? digit+;
  label     = quote lbl quote;
  string    = str;
  free_text = (str spaces)+ str;
  delimit   = delimit_h | unknown;
  quality   = quality_h | unknown;
 
  /** Coord types */
  x      = 'X';
  y      = 'Y';
  p      = 'P';
  t      = 'T';
  z      = 'Z';
  b      = 'B';
  button = 'BUTTON';
  rho    = 'RHO';
  theta  = 'THETA';
  phi    = 'PHI';

 
  /** Ignored keywords as tokens */
  comment      = dot comment_h (str spaces)* str;
  blank        = spaces+;
 
  /** Keywords as tokens */
  coord    = dot coord_h;
  pen_down = dot pen_down_h;
  pen_up   = dot pen_up_h;
  segment  = dot segment_h;
  encoding = dot enconding_h;
 
Ignored Tokens
  blank,
  comment;
 
Productions
 
  unipen_file = simple_statement*
    ;
 
  simple_statement = {encoding} encoding string
    | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
    | {pen} pen_down number+ pen_up
    | {segment} segment string delimit quality label
    ;

Thanks in advance,

Sebastian

Etienne M. Gagnon a écrit :

> Hi Sebastian,
>
> Have you, first, checked that the lexer is doing what you expect?
> Please use a debugging lexer :
>
>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>
> Often, it is the lexer that is wrong, when the parser seems to act
> inappropriately. :-)
>
> Have fun!
>
> Etienne
>
> Sebastián PEÑA SALDARRIAGA wrote:
>> Hello,
>>
>> I'm currently implementing a parser for unipen files
>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>> sablecc. I made a grammar that gives me this kind of errors :
>>
>> [3,10] expecting: string
>> [2,8] expecting: number
>>
>> With a file that starts with :
>>
>> .ENCODING UTF8
>> .COORD X Y P T
>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>> .PEN_DOWN
>>
>> My productions for coord and segment statements looks like this :
>>
>> unipen_file = simple_statement*
>>     ;
>>   simple_statement = {extra} extra_statement (...)
>>     | {mandatory} mandatory_statement
>>     | (...)
>>     | {annotation} annotation_statement
>>     | (...)
>>     ;
>>   extra_statement = encoding string
>>     ;
>>   mandatory_statement = (...)
>>     | {coordinates} coord coord_types+
>>     ;
>> annotation_statement = (...)
>>     | {segment} segment string delimit quality label
>>     | (...)
>>     ;
>>
>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are
>> defined
>> as specified by the def file :
>> escape_seq = ('\"' | '\n' | '\t');
>> lbl_char   = [all - quote] | escape_seq;
>> lbl        = lbl_char+;
>> not_cr_lf  = [all - [cr + lf]];
>> not_tab    = [not_cr_lf - tab];
>> str_char   = [not_tab - ' '];
>> str        = ([str_char - '.'] str_char*);
>> label       = quote lbl quote;
>>
>>
>> If I use the * operator instead of + with coord_types I get a "[2,8]
>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>> is well parsed and the others don't ? Someone has any ideas ?
>>
>> Sebastian PENA
>>
>>
>> _______________________________________________
>> SableCC-Discussion mailing list SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>
>>  
>
> --
> Etienne M. Gagnon, Ph.D.
> SableCC:                                            http://sablecc.org
> SableVM:                                            http://sablevm.org
>  
> ----------------------------------------------------------------------
> --
>
> _______________________________________________
> SableCC-Discussion mailing list SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

RE: Debug Lexer [Was: Silly error (?)]

by Christopher Van Kirk :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Actually, thinking about it a bit more, perhaps what you need to do is
exclude your keywords from the free_text/string domain. That might give you
the delimiter you need.

-----Original Message-----
From:
sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@...
[mailto:sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@....
org] On Behalf Of Sebastián PEÑA SALDARRIAGA
Sent: Thursday, February 14, 2008 7:06 PM
To: Discussion mailing list for the SableCC project.
Subject: Re: Debug Lexer [Was: Silly error (?)]


Hi Étienne,

Well it seems that the lexer is wrong. Cause when more than one regular
expression matches a token, the longest match is selected, the free_text
token is always the first one found. The definition of string/free
text/label in the original format definition is actually quite loose, it
leads to lots of ambiguities.

Here's a simplified version of my grammar, do you see any workarounds ?

Package fr.lina.phdteshis.unipen.parsing;

Helpers
 
  /** End of line */
  cr  = 13;
  lf  = 10;
  eol = (cr lf | cr | lf);
 
  /** General helpers */
  tab        = 9;
  quote      = '"';
  all        = [0 .. 0xFFFF];
  digit      = [0 .. 9];
  spaces     = (cr | lf | ' ' | tab);
  escape_seq = ('\"' | '\n' | '\t');
  lbl_char   = [all - quote] | escape_seq;
  lbl        = lbl_char+;
  not_cr_lf  = [all - [cr + lf]];
  not_tab    = [not_cr_lf - tab];
  str_char   = [not_tab - ' '];
  str        = ([str_char - '.'] str_char*);
  dot        = '.';
  sign       = ('+' | '-');
 
  /** Argument bases */
  unknown   = '?';
  quality_h = ('BAD' | 'OK' | 'GOOD');
 
  /** Stroke delineation */
  comma         = ',';
  minus         = '-';
  base_nbr      = digit+;
  nbr_in_stroke = ':' base_nbr;
  stroke_nbr    = base_nbr nbr_in_stroke?;
  delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
(((stroke_nbr minus stroke_nbr) | (base_nbr)));
 
  /** Keyword declarations */
  coord_h     = 'COORD';
  pen_down_h  = 'PEN_DOWN';
  pen_up_h    = 'PEN_UP';
  segment_h   = 'SEGMENT';
  enconding_h = 'ENCODING';
  comment_h   = 'COMMENT';
 
Tokens
 
  /** Argument types */
  number    = sign? digit* dot? digit+;
  label     = quote lbl quote;
  string    = str;
  free_text = (str spaces)+ str;
  delimit   = delimit_h | unknown;
  quality   = quality_h | unknown;
 
  /** Coord types */
  x      = 'X';
  y      = 'Y';
  p      = 'P';
  t      = 'T';
  z      = 'Z';
  b      = 'B';
  button = 'BUTTON';
  rho    = 'RHO';
  theta  = 'THETA';
  phi    = 'PHI';

 
  /** Ignored keywords as tokens */
  comment      = dot comment_h (str spaces)* str;
  blank        = spaces+;
 
  /** Keywords as tokens */
  coord    = dot coord_h;
  pen_down = dot pen_down_h;
  pen_up   = dot pen_up_h;
  segment  = dot segment_h;
  encoding = dot enconding_h;
 
Ignored Tokens
  blank,
  comment;
 
Productions
 
  unipen_file = simple_statement*
    ;
 
  simple_statement = {encoding} encoding string
    | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
    | {pen} pen_down number+ pen_up
    | {segment} segment string delimit quality label
    ;

Thanks in advance,

Sebastian

Etienne M. Gagnon a écrit :

> Hi Sebastian,
>
> Have you, first, checked that the lexer is doing what you expect?
> Please use a debugging lexer :
>
>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>
> Often, it is the lexer that is wrong, when the parser seems to act
> inappropriately. :-)
>
> Have fun!
>
> Etienne
>
> Sebastián PEÑA SALDARRIAGA wrote:
>> Hello,
>>
>> I'm currently implementing a parser for unipen files
>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>> sablecc. I made a grammar that gives me this kind of errors :
>>
>> [3,10] expecting: string
>> [2,8] expecting: number
>>
>> With a file that starts with :
>>
>> .ENCODING UTF8
>> .COORD X Y P T
>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>> .PEN_DOWN
>>
>> My productions for coord and segment statements looks like this :
>>
>> unipen_file = simple_statement*
>>     ;
>>   simple_statement = {extra} extra_statement (...)
>>     | {mandatory} mandatory_statement
>>     | (...)
>>     | {annotation} annotation_statement
>>     | (...)
>>     ;
>>   extra_statement = encoding string
>>     ;
>>   mandatory_statement = (...)
>>     | {coordinates} coord coord_types+
>>     ;
>> annotation_statement = (...)
>>     | {segment} segment string delimit quality label
>>     | (...)
>>     ;
>>
>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are
>> defined
>> as specified by the def file :
>> escape_seq = ('\"' | '\n' | '\t');
>> lbl_char   = [all - quote] | escape_seq;
>> lbl        = lbl_char+;
>> not_cr_lf  = [all - [cr + lf]];
>> not_tab    = [not_cr_lf - tab];
>> str_char   = [not_tab - ' '];
>> str        = ([str_char - '.'] str_char*);
>> label       = quote lbl quote;
>>
>>
>> If I use the * operator instead of + with coord_types I get a "[2,8]
>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>> is well parsed and the others don't ? Someone has any ideas ?
>>
>> Sebastian PENA
>>
>>
>> _______________________________________________
>> SableCC-Discussion mailing list SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>
>>  
>
> --
> Etienne M. Gagnon, Ph.D.
> SableCC:                                            http://sablecc.org
> SableVM:                                            http://sablevm.org
>  
> ----------------------------------------------------------------------
> --
>
> _______________________________________________
> SableCC-Discussion mailing list SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Debug Lexer [Was: Silly error (?)]

by Sebastián PEÑA SALDARRIAGA :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The main purpose are multi-line comments I guess. There's another kind
of token called label that is quotation-delimited but it's got a
specific fonction.
Thoroughly thinking about it, I decided to handle everything as a string
(no more free_text) and handling the reserverd words such as X, Y, THETA
in my implementation of the FirstDepthAnalysis.

Sebastian

Christopher Van Kirk a écrit :

> What is the purpose of the free_text production? That seems to be the root
> of your problem. Usually such broadly defined tokens are delimited in some
> way, e.g. with quotations, braces or parens.
>
> -----Original Message-----
> From:
> sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@...
> [mailto:sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@....
> org] On Behalf Of Sebastián PEÑA SALDARRIAGA
> Sent: Thursday, February 14, 2008 7:06 PM
> To: Discussion mailing list for the SableCC project.
> Subject: Re: Debug Lexer [Was: Silly error (?)]
>
>
> Hi Étienne,
>
> Well it seems that the lexer is wrong. Cause when more than one regular
> expression matches a token, the longest match is selected, the free_text
> token is always the first one found. The definition of string/free
> text/label in the original format definition is actually quite loose, it
> leads to lots of ambiguities.
>
> Here's a simplified version of my grammar, do you see any workarounds ?
>
> Package fr.lina.phdteshis.unipen.parsing;
>
> Helpers
>  
>   /** End of line */
>   cr  = 13;
>   lf  = 10;
>   eol = (cr lf | cr | lf);
>  
>   /** General helpers */
>   tab        = 9;
>   quote      = '"';
>   all        = [0 .. 0xFFFF];
>   digit      = [0 .. 9];
>   spaces     = (cr | lf | ' ' | tab);
>   escape_seq = ('\"' | '\n' | '\t');
>   lbl_char   = [all - quote] | escape_seq;
>   lbl        = lbl_char+;
>   not_cr_lf  = [all - [cr + lf]];
>   not_tab    = [not_cr_lf - tab];
>   str_char   = [not_tab - ' '];
>   str        = ([str_char - '.'] str_char*);
>   dot        = '.';
>   sign       = ('+' | '-');
>  
>   /** Argument bases */
>   unknown   = '?';
>   quality_h = ('BAD' | 'OK' | 'GOOD');
>  
>   /** Stroke delineation */
>   comma         = ',';
>   minus         = '-';
>   base_nbr      = digit+;
>   nbr_in_stroke = ':' base_nbr;
>   stroke_nbr    = base_nbr nbr_in_stroke?;
>   delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
> (((stroke_nbr minus stroke_nbr) | (base_nbr)));
>  
>   /** Keyword declarations */
>   coord_h     = 'COORD';
>   pen_down_h  = 'PEN_DOWN';
>   pen_up_h    = 'PEN_UP';
>   segment_h   = 'SEGMENT';
>   enconding_h = 'ENCODING';
>   comment_h   = 'COMMENT';
>  
> Tokens
>  
>   /** Argument types */
>   number    = sign? digit* dot? digit+;
>   label     = quote lbl quote;
>   string    = str;
>   free_text = (str spaces)+ str;
>   delimit   = delimit_h | unknown;
>   quality   = quality_h | unknown;
>  
>   /** Coord types */
>   x      = 'X';
>   y      = 'Y';
>   p      = 'P';
>   t      = 'T';
>   z      = 'Z';
>   b      = 'B';
>   button = 'BUTTON';
>   rho    = 'RHO';
>   theta  = 'THETA';
>   phi    = 'PHI';
>
>  
>   /** Ignored keywords as tokens */
>   comment      = dot comment_h (str spaces)* str;
>   blank        = spaces+;
>  
>   /** Keywords as tokens */
>   coord    = dot coord_h;
>   pen_down = dot pen_down_h;
>   pen_up   = dot pen_up_h;
>   segment  = dot segment_h;
>   encoding = dot enconding_h;
>  
> Ignored Tokens
>   blank,
>   comment;
>  
> Productions
>  
>   unipen_file = simple_statement*
>     ;
>  
>   simple_statement = {encoding} encoding string
>     | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
>     | {pen} pen_down number+ pen_up
>     | {segment} segment string delimit quality label
>     ;
>
> Thanks in advance,
>
> Sebastian
>
> Etienne M. Gagnon a écrit :
>  
>> Hi Sebastian,
>>
>> Have you, first, checked that the lexer is doing what you expect?
>> Please use a debugging lexer :
>>
>>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>>
>> Often, it is the lexer that is wrong, when the parser seems to act
>> inappropriately. :-)
>>
>> Have fun!
>>
>> Etienne
>>
>> Sebastián PEÑA SALDARRIAGA wrote:
>>    
>>> Hello,
>>>
>>> I'm currently implementing a parser for unipen files
>>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>>> sablecc. I made a grammar that gives me this kind of errors :
>>>
>>> [3,10] expecting: string
>>> [2,8] expecting: number
>>>
>>> With a file that starts with :
>>>
>>> .ENCODING UTF8
>>> .COORD X Y P T
>>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>>> .PEN_DOWN
>>>
>>> My productions for coord and segment statements looks like this :
>>>
>>> unipen_file = simple_statement*
>>>     ;
>>>   simple_statement = {extra} extra_statement (...)
>>>     | {mandatory} mandatory_statement
>>>     | (...)
>>>     | {annotation} annotation_statement
>>>     | (...)
>>>     ;
>>>   extra_statement = encoding string
>>>     ;
>>>   mandatory_statement = (...)
>>>     | {coordinates} coord coord_types+
>>>     ;
>>> annotation_statement = (...)
>>>     | {segment} segment string delimit quality label
>>>     | (...)
>>>     ;
>>>
>>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are
>>> defined
>>> as specified by the def file :
>>> escape_seq = ('\"' | '\n' | '\t');
>>> lbl_char   = [all - quote] | escape_seq;
>>> lbl        = lbl_char+;
>>> not_cr_lf  = [all - [cr + lf]];
>>> not_tab    = [not_cr_lf - tab];
>>> str_char   = [not_tab - ' '];
>>> str        = ([str_char - '.'] str_char*);
>>> label       = quote lbl quote;
>>>
>>>
>>> If I use the * operator instead of + with coord_types I get a "[2,8]
>>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>>> is well parsed and the others don't ? Someone has any ideas ?
>>>
>>> Sebastian PENA
>>>
>>>
>>> _______________________________________________
>>> SableCC-Discussion mailing list SableCC-Discussion@...
>>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>>
>>>  
>>>      
>> --
>> Etienne M. Gagnon, Ph.D.
>> SableCC:                                            http://sablecc.org
>> SableVM:                                            http://sablevm.org
>>  
>> ----------------------------------------------------------------------
>> --
>>
>> _______________________________________________
>> SableCC-Discussion mailing list SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>  
>>    
>
>
>
> _______________________________________________
> SableCC-Discussion mailing list SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>
>
> _______________________________________________
> SableCC-Discussion mailing list
> SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Debug Lexer [Was: Silly error (?)]

by Sebastián PEÑA SALDARRIAGA :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

How can I do that ? I know how to exclude chars but no words.

Christopher Van Kirk a écrit :

> Actually, thinking about it a bit more, perhaps what you need to do is
> exclude your keywords from the free_text/string domain. That might give you
> the delimiter you need.
>
> -----Original Message-----
> From:
> sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@...
> [mailto:sablecc-discussion-bounces+chris.vankirk=fdcjapan.com@....
> org] On Behalf Of Sebastián PEÑA SALDARRIAGA
> Sent: Thursday, February 14, 2008 7:06 PM
> To: Discussion mailing list for the SableCC project.
> Subject: Re: Debug Lexer [Was: Silly error (?)]
>
>
> Hi Étienne,
>
> Well it seems that the lexer is wrong. Cause when more than one regular
> expression matches a token, the longest match is selected, the free_text
> token is always the first one found. The definition of string/free
> text/label in the original format definition is actually quite loose, it
> leads to lots of ambiguities.
>
> Here's a simplified version of my grammar, do you see any workarounds ?
>
> Package fr.lina.phdteshis.unipen.parsing;
>
> Helpers
>  
>   /** End of line */
>   cr  = 13;
>   lf  = 10;
>   eol = (cr lf | cr | lf);
>  
>   /** General helpers */
>   tab        = 9;
>   quote      = '"';
>   all        = [0 .. 0xFFFF];
>   digit      = [0 .. 9];
>   spaces     = (cr | lf | ' ' | tab);
>   escape_seq = ('\"' | '\n' | '\t');
>   lbl_char   = [all - quote] | escape_seq;
>   lbl        = lbl_char+;
>   not_cr_lf  = [all - [cr + lf]];
>   not_tab    = [not_cr_lf - tab];
>   str_char   = [not_tab - ' '];
>   str        = ([str_char - '.'] str_char*);
>   dot        = '.';
>   sign       = ('+' | '-');
>  
>   /** Argument bases */
>   unknown   = '?';
>   quality_h = ('BAD' | 'OK' | 'GOOD');
>  
>   /** Stroke delineation */
>   comma         = ',';
>   minus         = '-';
>   base_nbr      = digit+;
>   nbr_in_stroke = ':' base_nbr;
>   stroke_nbr    = base_nbr nbr_in_stroke?;
>   delimit_h     = (((stroke_nbr minus stroke_nbr) | (base_nbr)) comma)*
> (((stroke_nbr minus stroke_nbr) | (base_nbr)));
>  
>   /** Keyword declarations */
>   coord_h     = 'COORD';
>   pen_down_h  = 'PEN_DOWN';
>   pen_up_h    = 'PEN_UP';
>   segment_h   = 'SEGMENT';
>   enconding_h = 'ENCODING';
>   comment_h   = 'COMMENT';
>  
> Tokens
>  
>   /** Argument types */
>   number    = sign? digit* dot? digit+;
>   label     = quote lbl quote;
>   string    = str;
>   free_text = (str spaces)+ str;
>   delimit   = delimit_h | unknown;
>   quality   = quality_h | unknown;
>  
>   /** Coord types */
>   x      = 'X';
>   y      = 'Y';
>   p      = 'P';
>   t      = 'T';
>   z      = 'Z';
>   b      = 'B';
>   button = 'BUTTON';
>   rho    = 'RHO';
>   theta  = 'THETA';
>   phi    = 'PHI';
>
>  
>   /** Ignored keywords as tokens */
>   comment      = dot comment_h (str spaces)* str;
>   blank        = spaces+;
>  
>   /** Keywords as tokens */
>   coord    = dot coord_h;
>   pen_down = dot pen_down_h;
>   pen_up   = dot pen_up_h;
>   segment  = dot segment_h;
>   encoding = dot enconding_h;
>  
> Ignored Tokens
>   blank,
>   comment;
>  
> Productions
>  
>   unipen_file = simple_statement*
>     ;
>  
>   simple_statement = {encoding} encoding string
>     | {coordinates} coord x? y? z? p? t? b? rho? theta? button? phi?
>     | {pen} pen_down number+ pen_up
>     | {segment} segment string delimit quality label
>     ;
>
> Thanks in advance,
>
> Sebastian
>
> Etienne M. Gagnon a écrit :
>  
>> Hi Sebastian,
>>
>> Have you, first, checked that the lexer is doing what you expect?
>> Please use a debugging lexer :
>>
>>     * http://lists.sablecc.org/pipermail/sablecc-user/msg00004.html
>>
>> Often, it is the lexer that is wrong, when the parser seems to act
>> inappropriately. :-)
>>
>> Have fun!
>>
>> Etienne
>>
>> Sebastián PEÑA SALDARRIAGA wrote:
>>    
>>> Hello,
>>>
>>> I'm currently implementing a parser for unipen files
>>> (http://hwr.nici.kun.nl/unipen/uptools3/general/unipen-def.html) with
>>> sablecc. I made a grammar that gives me this kind of errors :
>>>
>>> [3,10] expecting: string
>>> [2,8] expecting: number
>>>
>>> With a file that starts with :
>>>
>>> .ENCODING UTF8
>>> .COORD X Y P T
>>> .SEGMENT ? ? ? "EC COMMISSION DETAILS (...)"
>>> .PEN_DOWN
>>>
>>> My productions for coord and segment statements looks like this :
>>>
>>> unipen_file = simple_statement*
>>>     ;
>>>   simple_statement = {extra} extra_statement (...)
>>>     | {mandatory} mandatory_statement
>>>     | (...)
>>>     | {annotation} annotation_statement
>>>     | (...)
>>>     ;
>>>   extra_statement = encoding string
>>>     ;
>>>   mandatory_statement = (...)
>>>     | {coordinates} coord coord_types+
>>>     ;
>>> annotation_statement = (...)
>>>     | {segment} segment string delimit quality label
>>>     | (...)
>>>     ;
>>>
>>> Where coord_types is like 'X' | 'Y' etc. Strings and labels are
>>> defined
>>> as specified by the def file :
>>> escape_seq = ('\"' | '\n' | '\t');
>>> lbl_char   = [all - quote] | escape_seq;
>>> lbl        = lbl_char+;
>>> not_cr_lf  = [all - [cr + lf]];
>>> not_tab    = [not_cr_lf - tab];
>>> str_char   = [not_tab - ' '];
>>> str        = ([str_char - '.'] str_char*);
>>> label       = quote lbl quote;
>>>
>>>
>>> If I use the * operator instead of + with coord_types I get a "[2,8]
>>> expecting: EOF" error. I'm using sablecc 3.2. Why the encoding statement
>>> is well parsed and the others don't ? Someone has any ideas ?
>>>
>>> Sebastian PENA
>>>
>>>
>>> _______________________________________________
>>> SableCC-Discussion mailing list SableCC-Discussion@...
>>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>>
>>>  
>>>      
>> --
>> Etienne M. Gagnon, Ph.D.
>> SableCC:                                            http://sablecc.org
>> SableVM:                                            http://sablevm.org
>>  
>> ----------------------------------------------------------------------
>> --
>>
>> _______________________________________________
>> SableCC-Discussion mailing list SableCC-Discussion@...
>> http://lists.sablecc.org/listinfo/sablecc-discussion
>>  
>>    
>
>
>
> _______________________________________________
> SableCC-Discussion mailing list SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>
>
> _______________________________________________
> SableCC-Discussion mailing list
> SableCC-Discussion@...
> http://lists.sablecc.org/listinfo/sablecc-discussion
>  



_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

Re: Debug Lexer [Was: Silly error (?)]

by Etienne M. Gagnon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Sebastian,

The lexer works as follows:
  1. It first finds the longest string that corresponds to any token definition.
  2. If this longest string corresponds to more than one token, it selects the token that appears first in the Tokens section.
As a consequence, it is always recommended to put keywords and shorter tokens first, in the Tokens section.

As an example, a typical programming language has reserved words and identifiers. Keywords do look like identifiers, but we usually want to give precedence to keywords. Here is how we would do this:
Helpers

  letter = ['a'..'z'];

Tokens

  // keywords
  if = 'if';
  then = 'then';
  else = 'else';

  // identifier
  id = letter+;
As you can see, the string "if" corresponds to both tokens if and id, so as the if token appears first, it will be selected by the lexer.

On the other hand, the string "iff" is longer and only corresponds to the id token. So, if the input file contains this string, the id token will be selected (instead of the shorter if token).

SableCC 4 (unavailable) changes this behavior. It now detects all conflicts between token definitions and require the language designer to explicitly state token precedence, as to avoid the token precedence problem, which is a recurring problem for new users of parsing tools.

In the particular case of your grammar, you should definitely move your longer tokens to the end of the Tokens list, e.g. number, string, free_text, etc. But you should also pay attention to the desired results; SableCC will always select the longest match first, then the first listed token. If there are different contexts where a string should match distinct tokens, then you need to use lexer states. (That's another source of problems that SableCC 4 attacks). If you can get away without using lexer states, do it! :-)

Have fun!

Etienne

Sebastián PEÑA SALDARRIAGA wrote :
How can I do that ? I know how to exclude chars but no words.

Christopher Van Kirk a écrit :
  
Actually, thinking about it a bit more, perhaps what you need to do is
exclude your keywords from the free_text/string domain. That might give you
the delimiter you need.

...
Tokens
 
  /** Argument types */
  number    = sign? digit* dot? digit+;
  label     = quote lbl quote;
  string    = str;
  free_text = (str spaces)+ str;
  delimit   = delimit_h | unknown;
  quality   = quality_h | unknown;
 
  /** Coord types */
  x      = 'X';
  y      = 'Y';
  p      = 'P';
  t      = 'T';
  z      = 'Z';
  b      = 'B';
  button = 'BUTTON';
  rho    = 'RHO';
  theta  = 'THETA';
  phi    = 'PHI';
...
    

-- 
Etienne M. Gagnon, Ph.D.
SableCC:                                            http://sablecc.org
SableVM:                                            http://sablevm.org


_______________________________________________
SableCC-Discussion mailing list
SableCC-Discussion@...
http://lists.sablecc.org/listinfo/sablecc-discussion

signature.asc (265 bytes) Download Attachment