|
View:
New views
5 Messages
—
Rating Filter:
Alert me
|
|
|
Certain non-zero-length non-matching regexes run forever on SaxonHi Saxon users,
I don't usually do any cross-posts, but this time I think I have to, my apologies beforehand. I abridged the text a bit, sorry if it is still rather wieldy. Original subject line: "Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?" [original, altered message] I happen to run into an XSLT 2 behavior that I am uncertain of if it is desired or expected behavior of implementations, if it is undesired, but still expected, or if it should be prohibited. Now, suppose this smallest possible match is zero length. In this situation, with classic regex parsers, the engine will try forever on a non-matching string (or at least very long). In XSLT, this behavior seems to happen even when the "smallest possible match" is of non-zero length. Note that this applies to subexpressions only, because zero-length matching regex (in its entirety) is not allowed in XSLT. In XSLT this is shown as follows, where I use the example of matching a CSV quoted string which may escape the quote by doubling it: Good csv string examples: "well quoted csv string" "well ""quoted"" csv string Bad csv string example: not well "quoted csv string The regular expression to match a quoted CSV string is (include quotes): "([^"]+|"")*" This gives trouble in the following XSLT (showing a non-matching string), which runs for a very long time (exponential performance chart) <xsl:variable name="ltr"> before " after and some more text after </xsl:variable> <xsl:variable name="regex">"([^"]+|"")*"</xsl:variable> <xsl:analyze-string select="$ltr" regex="{$regex}"> <xsl:matching-substring> <found><xsl:value-of select="." /></found> </xsl:matching-substring> <xsl:non-matching-substring> <not-found><xsl:value-of select="." /></not-found> </xsl:non-matching-substring> </xsl:analyze-string> The above example ran for > 140 minutes and still runs (it does not give any problems with matching strings, but does give problems with strings that have a large non-matching part with a unequal quote count in it). A simpler regex (but with less practicability) that shows the same behavior (quotes are again part of the regex): "([^"]+)*" I am under the impression that such behavior is not desirable, but I am unsure if there is anything in the specs that says something about how implementations should/must deal with this. As a comparison, I tried the example with Perl, which gave no noticeable performance troubles. I am aware of the fact that Saxon relies havily on Sun's Java regex engine, but I also know that it does a translation beforehand. Note that repeating a possible-empty match is dangerous and should be avoided. But in the example above, I use a subexpression that never matches an empty string (the + quantifier means 1 or more) and as such should not happen to fall into the eternal-loop problem. I use Saxon 8.8 with Java 1.5. I did not yet try 8.8.0.4 Note that it may or may not be possible to optimize the regex in such a way that it fails earlier on the given string. But my problem is with the (imo) unpredictability of the performance (or even lockup) with any less-then-very-trivial regex. Cheers, -- Abel Braaksma http://www.nuntia.nl ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ saxon-help mailing list saxon-help@... https://lists.sourceforge.net/lists/listinfo/saxon-help |
|
|
Re: Certain non-zero-length non-matching regexes run forever on Saxon--On Tuesday, January 23, 2007 17:18:56 +0100 Abel Braaksma wrote:
> I am > aware of the fact that Saxon relies havily on Sun's Java regex engine, > but I also know that it does a translation beforehand. In this case, the problem is in the Java regex engine but according to <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051> and the related entries in the Java bug database some matches have exponential time complexity, at least with the algorithms Java uses so this is not a bug they intend to fix. (I don't know if there are other algorithms that can handle such cases.) An alternative for the particular case that seems not to cause the problem is: <xsl:variable name="regex">("[^"]*")+</xsl:variable> -- Owen Rees Hewlett Packard Laboratories, Bristol, UK ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ saxon-help mailing list saxon-help@... https://lists.sourceforge.net/lists/listinfo/saxon-help |
|
|
Re: Certain non-zero-length non-matching regexes run forever on SaxonOwen Rees wrote:
> --On Tuesday, January 23, 2007 17:18:56 +0100 Abel Braaksma wrote: > > In this case, the problem is in the Java regex engine but according to > <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6393051> and the > related entries in the Java bug database some matches have exponential time > complexity, at least with the algorithms Java uses so this is not a bug > they intend to fix. (I don't know if there are other algorithms that can > handle such cases.) > There are, plenty. Try Ruby, Perl, Python etc. None of these seem to have the same problem. > An alternative for the particular case that seems not to cause the problem > is: > > <xsl:variable name="regex">("[^"]*")+</xsl:variable> > Thanks, Owen. That surely solves this particular problem. I didn't want to go into too much detail, but I run into this problem more often than I could hope for. I always make sure that the total regex is never zero-length, but I cannot be sure of the sub matches (not always, at least). The problem gets more complex in light of bidi text, where this originated. However, since this seems to be a Java problem, I'll just have to make sure that I thoroughly test every regex for production quality... -- Abel ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ saxon-help mailing list saxon-help@... https://lists.sourceforge.net/lists/listinfo/saxon-help |
|
|
Re: Certain non-zero-length non-matching regexes run forever onSaxonThe specs have nothing to say about performance.
It's useful to have and to share information about badly-performing constructs, but on the evidence of one example I don't think I plan to do anything about it. Saxon uses the underlying regex engine in Java or .NET, and these seems to perform reasonably well most of the time. I could raise examples of poor performance with the JDK developers, but I'm not sure whether this would lead to anything: they presumably have a lot more information about regex performance than I do. Michael Kay http://www.saxonica.com/ > -----Original Message----- > From: saxon-help-bounces@... > [mailto:saxon-help-bounces@...] On Behalf > Of Abel Braaksma > Sent: 23 January 2007 16:19 > To: Mailing list for SAXON XSLT queries > Subject: [saxon] Certain non-zero-length non-matching regexes > run forever onSaxon > > Hi Saxon users, > > I don't usually do any cross-posts, but this time I think I > have to, my apologies beforehand. I abridged the text a bit, > sorry if it is still rather wieldy. Original subject line: > "Backtracking and eternal loops caused by regular expressions > matching: what to expect from implementations?" > > [original, altered message] > > I happen to run into an XSLT 2 behavior that I am uncertain > of if it is desired or expected behavior of implementations, > if it is undesired, but still expected, or if it should be prohibited. > > Now, suppose this smallest possible match is zero length. In > this situation, with classic regex parsers, the engine will > try forever on a non-matching string (or at least very long). > In XSLT, this behavior seems to happen even when the > "smallest possible match" is of non-zero length. Note that > this applies to subexpressions only, because zero-length > matching regex (in its entirety) is not allowed in XSLT. > > In XSLT this is shown as follows, where I use the example of > matching a CSV quoted string which may escape the quote by > doubling it: > Good csv string examples: > "well quoted csv string" > "well ""quoted"" csv string > > Bad csv string example: > not well "quoted csv string > > The regular expression to match a quoted CSV string is > (include quotes): > "([^"]+|"")*" > > This gives trouble in the following XSLT (showing a > non-matching string), which runs for a very long time > (exponential performance chart) > > <xsl:variable name="ltr"> > before " after and some more text after </xsl:variable> > <xsl:variable name="regex">"([^"]+|"")*"</xsl:variable> > > <xsl:analyze-string select="$ltr" regex="{$regex}"> > <xsl:matching-substring> > <found><xsl:value-of select="." /></found> > </xsl:matching-substring> > <xsl:non-matching-substring> > <not-found><xsl:value-of select="." /></not-found> > </xsl:non-matching-substring> > </xsl:analyze-string> > > > The above example ran for > 140 minutes and still runs (it > does not give any problems with matching strings, but does > give problems with strings that have a large non-matching > part with a unequal quote count in it). A simpler regex (but > with less practicability) that shows the same behavior > (quotes are again part of the regex): "([^"]+)*" > > I am under the impression that such behavior is not > desirable, but I am unsure if there is anything in the specs > that says something about how implementations should/must > deal with this. As a comparison, I tried the example with > Perl, which gave no noticeable performance troubles. I am > aware of the fact that Saxon relies havily on Sun's Java > regex engine, but I also know that it does a translation beforehand. > > Note that repeating a possible-empty match is dangerous and > should be avoided. But in the example above, I use a > subexpression that never matches an empty string (the + > quantifier means 1 or more) and as such should not happen to > fall into the eternal-loop problem. > > I use Saxon 8.8 with Java 1.5. I did not yet try 8.8.0.4 > > Note that it may or may not be possible to optimize the regex > in such a way that it fails earlier on the given string. But > my problem is with the (imo) unpredictability of the > performance (or even lockup) with any less-then-very-trivial regex. > > > Cheers, > -- Abel Braaksma > http://www.nuntia.nl > > -------------------------------------------------------------- > ----------- > Take Surveys. Earn Cash. Influence the Future of IT Join > SourceForge.net's Techsay panel and you'll get the chance to > share your opinions on IT & business topics through brief > surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > saxon-help mailing list > saxon-help@... > https://lists.sourceforge.net/lists/listinfo/saxon-help ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ saxon-help mailing list saxon-help@... https://lists.sourceforge.net/lists/listinfo/saxon-help |
|
|
Re: Certain non-zero-length non-matching regexes run forever onSaxonMichael Kay wrote:
> The specs have nothing to say about performance. > > It's useful to have and to share information about badly-performing > constructs, but on the evidence of one example I don't think I plan to do > anything about it. Saxon uses the underlying regex engine in Java or .NET, > and these seems to perform reasonably well most of the time. I could raise > examples of poor performance with the JDK developers, but I'm not sure > whether this would lead to anything: they presumably have a lot more > information about regex performance than I do. Hi Michael, I used "one example" for the sake of discussion, I don't want to flood you with all kinds of regexes just proofing my point. But here's perhaps the simplest I could think of that makes Saxon (or Java for that matter) run forever. It is a classical example of an exponentially (or super-linearly) performing regular expression. However, most engines nowadays short-circuit this (almost) eternal backtracking, because it is fairly easy to detect (i.e., in general, when all positions of a string are tested, it is no use to test them again, in which case the engine can stop and report 'no match', but Java does (string-length)! tests, which grows pretty quickly). In Java, it looks like this: Pattern ptrn = Pattern.compile("#(.+)+#"); Matcher matcher = ptrn.matcher("# this text does not match and runs very long"); while(matcher.find()) { System.out.println(matcher.group()); } System.out.println("end"); The regex is designed to find strings anchored by '#' and '#'. However, the (.+)+ construct is used to illustrate exponential looping. The above code runs pretty long, maybe hours or days (I stopped it after 5 minutes). Same for XSLT, of course. Many regexes have a hidden multiple quantifier selection, causing a lockup of the engine. Page 247 in "Mastering Regular Expressions 2nd Ed" by J. Friedl, he explains what can be done about it. In Chapter 6, he explains in depth how backtracking works. Page 226 is about "Exponential matches". I understand that this is a Java problem. If you are in need of several real world examples illustrating this behavior, I can give you plenty. Often, it is a sign of badly crafted regexes (Owen Rees showed me a better one) but sometimes there is little avoidance possible. And even if the regex is badly crafted, should the engine lockup? -- Abel Braaksma http://www.nuntia.nl ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ saxon-help mailing list saxon-help@... https://lists.sourceforge.net/lists/listinfo/saxon-help |
| Free Forum Powered by Nabble | Forum Help |