|
View:
New views
4 Messages
—
Rating Filter:
Alert me
|
|
|
greediness in lexer rulesHi all,
I'm trying to filter out a file that has some records of data separated as paragraphs (i.e. with two '\n' characters). What I want to do is to get the records that begins with "CO". This is the beginning of a sample file: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AS 121169 847728 CO contig00001 108 2 1 U ggggggggCAAGAGACATAATTTTtGATACCAGAGTAATATGAACACTGC CAGGTTCTGTATCTTACCCGTAaCTACCGgTATCTACTCCAAGAAACACG CAATaaaa BQ 20 20 20 20 20 20 16 16 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 27 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 59 64 64 64 64 64 64 64 64 64 64 49 64 45 64 64 64 64 64 64 64 64 64 64 53 64 64 64 36 49 64 64 64 64 64 36 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 35 35 35 35 AF E60LTDW02F58YM U 1 AF EQPIWTT02H8H8V.124-231 U -122 BS 1 108 E60LTDW02F58YM [...] vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv But there are more of those "CO" records through the file (always followed by a "BQ" record) I'm trying to use a lexer. I tried the following rule: rule contig = parse | "CO"_*"\n\n" { Lexing.lexeme lexbuf } | _ { contig lexbuf } But it appears that this matches from the first "CO" to the end of the file, maybe because of the greediness of the match. How can I specify a non-greedy match in the lexer? Should I take a different strategy to accomplish this? Thank you very much in advance, M; |
|
|
Re: "ocaml_beginners"::[] greediness in lexer rulesOn Tue, 8 Jul 2008, citromatik wrote:
> > Hi all, > > I'm trying to filter out a file that has some records of data separated as > paragraphs (i.e. with two '\n' characters). > What I want to do is to get the records that begins with "CO". > This is the beginning of a sample file: > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > AS 121169 847728 > > CO contig00001 108 2 1 U > ggggggggCAAGAGACATAATTTTtGATACCAGAGTAATATGAACACTGC > CAGGTTCTGTATCTTACCCGTAaCTACCGgTATCTACTCCAAGAAACACG > CAATaaaa > > BQ > 20 20 20 20 20 20 16 16 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 27 > 64 64 64 64 64 64 64 64 > 64 64 64 64 64 64 64 64 64 64 64 59 64 64 64 64 64 > 64 64 64 64 64 49 64 45 64 64 64 64 64 64 64 64 64 64 53 64 64 64 36 49 64 > 64 64 64 64 36 64 64 64 > 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 > 64 64 64 64 35 35 35 35 > > AF E60LTDW02F58YM U 1 > AF EQPIWTT02H8H8V.124-231 U -122 > > BS 1 108 E60LTDW02F58YM > > [...] > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > > But there are more of those "CO" records through the file (always followed > by a "BQ" record) > > I'm trying to use a lexer. I tried the following rule: > > rule contig = parse > | "CO"_*"\n\n" { Lexing.lexeme lexbuf } > | _ { contig lexbuf } > > But it appears that this matches from the first "CO" to the end of the file, > maybe because of the greediness of the match. > > How can I specify a non-greedy match in the lexer? Should I take a different > strategy to accomplish this? Please check by yourself, but I think you can use "shortest" instead of "parse". This would apply to the whole rule. Alternatively, you can use [^'\n']* instead of _*. Note that your second case may not be what you want: _ means a single character. Instead you may want to match a whole line with something like: [^'\n']* ( '\n' | eof ) Martin > Thank you very much in advance, > > M; > > -- > View this message in context: http://www.nabble.com/greediness-in-lexer-rules-tp18337699p18337699.html > Sent from the Ocaml Beginner mailing list archive at Nabble.com. > > > ------------------------------------ > > Archives up to December 31, 2007 are also downloadable at http://www.connettivo.net/cntprojects/ocaml_beginners/ > The archives of the very official ocaml list (the seniors' one) can be found at http://caml.inria.fr > Attachments are banned and you're asked to be polite, avoid flames etc.Yahoo! Groups Links > > > > -- http://wink.com/profile/mjambon http://mjambon.com/ |
|
|
Re: "ocaml_beginners"::[] greediness in lexer rulesYes, "shortests" works perfectly on this. Thanks a lot for the tip, I overlooked that from the docs. hmmm, I don't agree. The OCaml manual says that "_" matches any character, but "[^'\n']*" matches any character except a newline. Note that the records I want to match are multi-line. M; |
|
|
Re: "ocaml_beginners"::[] greediness in lexer rulesOn Tue, 8 Jul 2008, citromatik wrote:
> > > Martin Jambon wrote: >> >> On Tue, 8 Jul 2008, citromatik wrote: >> >> >> Please check by yourself, but I think you can use "shortest" instead of >> "parse". This would apply to the whole rule. >> > > Yes, "shortests" works perfectly on this. Thanks a lot for the tip, I > overlooked that from the docs. > > > >> Alternatively, you can use [^'\n']* instead of _*. >> > hmmm, I don't agree. The OCaml manual says that "_" matches any character, > but "[^'\n']*" matches any character except a newline. Note that the records > I want to match are multi-line. Right, I didn't realize that. Martin -- http://wink.com/profile/mjambon http://mjambon.com/ |
| Free Forum Powered by Nabble | Forum Help |