« Return to Thread: Certain non-zero-length non-matching regexes run forever on Saxon

Re: Certain non-zero-length non-matching regexes run forever onSaxon

by Abel Braaksma (online) :: Rate this Message:

Reply to Author | View in Thread

Michael Kay wrote:
> The specs have nothing to say about performance.
>
> It's useful to have and to share information about badly-performing
> constructs, but on the evidence of one example I don't think I plan to do
> anything about it. Saxon uses the underlying regex engine in Java or .NET,
> and these seems to perform reasonably well most of the time. I could raise
> examples of poor performance with the JDK developers, but I'm not sure
> whether this would lead to anything: they presumably have a lot more
> information about regex performance than I do.

Hi Michael,

I used "one example" for the sake of discussion, I don't want to flood
you with all kinds of regexes just proofing my point. But here's perhaps
the simplest I could think of that makes Saxon (or Java for that matter)
run forever. It is a classical example of an exponentially (or
super-linearly) performing regular expression.

However, most engines nowadays short-circuit this (almost) eternal
backtracking, because it is fairly easy to detect (i.e., in general,
when all positions of a string are tested, it is no use to test them
again, in which case the engine can stop and report 'no match', but Java
does (string-length)! tests, which grows pretty quickly).

In Java, it looks like this:

            Pattern ptrn = Pattern.compile("#(.+)+#");
            Matcher matcher = ptrn.matcher("# this text does not match
and runs very long");
           
            while(matcher.find()) {
                System.out.println(matcher.group());
            }
            System.out.println("end");



The regex is designed to find strings anchored by '#' and '#'. However,
the (.+)+ construct is used to illustrate exponential looping. The above
code runs pretty long, maybe hours or days (I stopped it after 5
minutes). Same for XSLT, of course.

Many regexes have a hidden multiple quantifier selection, causing a
lockup of the engine. Page 247 in "Mastering Regular Expressions 2nd Ed"
by J. Friedl, he explains what can be done about it. In Chapter 6, he
explains in depth how backtracking works. Page 226 is about "Exponential
matches".

I understand that this is a Java problem. If you are in need of several
real world examples illustrating this behavior, I can give you plenty.
Often, it is a sign of badly crafted regexes (Owen Rees showed me a
better one) but sometimes there is little avoidance possible. And even
if the regex is badly crafted, should the engine lockup?

-- Abel Braaksma
   http://www.nuntia.nl

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

 « Return to Thread: Certain non-zero-length non-matching regexes run forever on Saxon