greediness in lexer rules

View: New views
4 Messages — Rating Filter:   Alert me  

greediness in lexer rules

by citromatik :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

I'm trying to filter out a file that has some records of data separated as paragraphs (i.e. with two '\n' characters).
What I want to do is to get the records that begins with "CO".
This is the beginning of a sample file:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AS  121169 847728  

CO contig00001 108 2 1 U
ggggggggCAAGAGACATAATTTTtGATACCAGAGTAATATGAACACTGC
CAGGTTCTGTATCTTACCCGTAaCTACCGgTATCTACTCCAAGAAACACG
CAATaaaa

BQ
20 20 20 20 20 20 16 16 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 27 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64 64 64 59 64 64 64 64 64
64 64 64 64 64 49 64 45 64 64 64 64 64 64 64 64 64 64 53 64 64 64 36 49 64 64 64 64 64 36 64 64 64
64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
64 64 64 64 35 35 35 35

AF E60LTDW02F58YM U 1
AF EQPIWTT02H8H8V.124-231 U -122

BS 1 108 E60LTDW02F58YM

[...]
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

But there are more of those "CO" records through the file (always followed by a "BQ" record)

I'm trying to use a lexer. I tried the following rule:

rule contig = parse
  | "CO"_*"\n\n"  { Lexing.lexeme lexbuf }
  | _ { contig lexbuf }

But it appears that this matches from the first "CO" to the end of the file, maybe because of the greediness of the match.

How can I specify a non-greedy match in the lexer? Should I take a different strategy to accomplish this?

Thank you very much in advance,

M;

Re: "ocaml_beginners"::[] greediness in lexer rules

by Martin Jambon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 8 Jul 2008, citromatik wrote:

>
> Hi all,
>
> I'm trying to filter out a file that has some records of data separated as
> paragraphs (i.e. with two '\n' characters).
> What I want to do is to get the records that begins with "CO".
> This is the beginning of a sample file:
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> AS  121169 847728
>
> CO contig00001 108 2 1 U
> ggggggggCAAGAGACATAATTTTtGATACCAGAGTAATATGAACACTGC
> CAGGTTCTGTATCTTACCCGTAaCTACCGgTATCTACTCCAAGAAACACG
> CAATaaaa
>
> BQ
> 20 20 20 20 20 20 16 16 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 27
> 64 64 64 64 64 64 64 64
> 64 64 64 64 64 64 64 64 64 64 64 59 64 64 64 64 64
> 64 64 64 64 64 49 64 45 64 64 64 64 64 64 64 64 64 64 53 64 64 64 36 49 64
> 64 64 64 64 36 64 64 64
> 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
> 64 64 64 64 35 35 35 35
>
> AF E60LTDW02F58YM U 1
> AF EQPIWTT02H8H8V.124-231 U -122
>
> BS 1 108 E60LTDW02F58YM
>
> [...]
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>
> But there are more of those "CO" records through the file (always followed
> by a "BQ" record)
>
> I'm trying to use a lexer. I tried the following rule:
>
> rule contig = parse
>  | "CO"_*"\n\n"  { Lexing.lexeme lexbuf }
>  | _ { contig lexbuf }
>
> But it appears that this matches from the first "CO" to the end of the file,
> maybe because of the greediness of the match.
>
> How can I specify a non-greedy match in the lexer? Should I take a different
> strategy to accomplish this?


Please check by yourself, but I think you can use "shortest" instead of
"parse". This would apply to the whole rule.
Alternatively, you can use [^'\n']* instead of _*.

Note that your second case may not be what you want: _ means a single
character. Instead you may want to match a whole line with something like:

    [^'\n']* ( '\n' | eof )


Martin


> Thank you very much in advance,
>
> M;
>
> --
> View this message in context: http://www.nabble.com/greediness-in-lexer-rules-tp18337699p18337699.html
> Sent from the Ocaml Beginner mailing list archive at Nabble.com.
>
>
> ------------------------------------
>
> Archives up to December 31, 2007 are also downloadable at http://www.connettivo.net/cntprojects/ocaml_beginners/
> The archives of the very official ocaml list (the seniors' one) can be found at http://caml.inria.fr
> Attachments are banned and you're asked to be polite, avoid flames etc.Yahoo! Groups Links
>
>
>
>

--
http://wink.com/profile/mjambon
http://mjambon.com/

Re: "ocaml_beginners"::[] greediness in lexer rules

by citromatik :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Martin Jambon wrote:
On Tue, 8 Jul 2008, citromatik wrote:


Please check by yourself, but I think you can use "shortest" instead of
"parse". This would apply to the whole rule.
Yes, "shortests" works perfectly on this. Thanks a lot for the tip, I overlooked that from the docs.

Alternatively, you can use [^'\n']* instead of _*.
hmmm, I don't agree. The OCaml manual says that "_" matches any character, but "[^'\n']*" matches any character except a newline. Note that the records I want to match are multi-line.

M;

Re: "ocaml_beginners"::[] greediness in lexer rules

by Martin Jambon :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, 8 Jul 2008, citromatik wrote:

>
>
> Martin Jambon wrote:
>>
>> On Tue, 8 Jul 2008, citromatik wrote:
>>
>>
>> Please check by yourself, but I think you can use "shortest" instead of
>> "parse". This would apply to the whole rule.
>>
>
> Yes, "shortests" works perfectly on this. Thanks a lot for the tip, I
> overlooked that from the docs.
>
>
>
>> Alternatively, you can use [^'\n']* instead of _*.
>>
> hmmm, I don't agree. The OCaml manual says that "_" matches any character,
> but "[^'\n']*" matches any character except a newline. Note that the records
> I want to match are multi-line.

Right, I didn't realize that.


Martin

--
http://wink.com/profile/mjambon
http://mjambon.com/
LightInTheBox - Buy quality products at wholesale price!