« Return to Thread: Certain non-zero-length non-matching regexes run forever on Saxon

Certain non-zero-length non-matching regexes run forever on Saxon

by Abel Braaksma (online) :: Rate this Message:

Reply to Author | View in Thread

Hi Saxon users,

I don't usually do any cross-posts, but this time I think I have to, my
apologies beforehand. I abridged the text a bit, sorry if it is still
rather wieldy. Original subject line: "Backtracking and eternal loops
caused by regular expressions matching: what to expect from
implementations?"

[original, altered message]

I happen to run into an XSLT 2 behavior that I am uncertain of if it is
desired or expected behavior of implementations, if it is undesired, but
still expected, or if it should be prohibited.

Now, suppose this smallest possible match is zero length. In this
situation, with classic regex parsers, the engine will try forever on a
non-matching string (or at least very long). In XSLT, this behavior
seems to happen even when the "smallest possible match" is of non-zero
length. Note that this applies to subexpressions only, because
zero-length matching regex (in its entirety) is not allowed in XSLT.

In XSLT this is shown as follows, where I use the example of matching a
CSV quoted string which may escape the quote by doubling it:
Good csv string examples:
"well quoted csv string"
"well ""quoted"" csv string

Bad csv string example:
not well "quoted csv string

The regular expression to match a quoted CSV string is (include quotes):
"([^"]+|"")*"

This gives trouble in the following XSLT (showing a non-matching
string), which runs for a very long time (exponential performance chart)

<xsl:variable name="ltr">
   before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>

<xsl:analyze-string select="$ltr" regex="{$regex}">
   <xsl:matching-substring>
      <found><xsl:value-of select="." /></found>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <not-found><xsl:value-of select="." /></not-found>
   </xsl:non-matching-substring>
</xsl:analyze-string>


The above example ran for > 140 minutes and still runs (it does not give
any problems with matching strings, but does give problems with strings
that have a large non-matching part with a unequal quote count in it). A
simpler regex (but with less practicability) that shows the same
behavior (quotes are again part of the regex): "([^"]+)*"

I am under the impression that such behavior is not desirable, but I am
unsure if there is anything in the specs that says something about how
implementations should/must deal with this. As a comparison, I tried the
example with Perl, which gave no noticeable performance troubles. I am
aware of the fact that Saxon relies havily on Sun's Java regex engine,
but I also know that it does a translation beforehand.

Note that repeating a possible-empty match is dangerous and should be
avoided. But in the example above, I use a subexpression that never
matches an empty string (the + quantifier means 1 or more) and as such
should not happen to fall into the eternal-loop problem.

I use Saxon 8.8 with Java 1.5. I did not yet try 8.8.0.4

Note that it may or may not be possible to optimize the regex in such a
way that it fails earlier on the given string. But my problem is with
the (imo) unpredictability of the performance (or even lockup) with any
less-then-very-trivial regex.


Cheers,
-- Abel Braaksma
  http://www.nuntia.nl

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
saxon-help mailing list
saxon-help@...
https://lists.sourceforge.net/lists/listinfo/saxon-help

 « Return to Thread: Certain non-zero-length non-matching regexes run forever on Saxon

LightInTheBox - Buy quality products at wholesale price