Dear eXistentialists,
A few weeks ago I posted an inquiry about getting eXist to index new
Unicode 5.1 characters properly. Since eXist relies on ischar() and the
versions of Java running on our machines have only a Unicode 5.0 sense
of what is and isn't a character, eXist was failing to treat the new 5.1
characters like the alphabetic characters they are. With help from the
eXist developers, I patched SimpleTokenizer.java so that searches using
the XPath contains() function and range indexes would work as I needed
them to.
Today I tried to use the full-text index on the same files and ran into
a nasty surprise: despite my patch, the full-text index is treating the
new Unicode characters as if they were non-alphabetic. This means that
when I search for a word that contains, for example, "xy", where "x" is
a Unicode 5.1 character and "y" is an older Unicode character,
//path[contains(.,'xy')] does what I want but //path[. &= 'xy*']
doesn't. I realize that these two expressions will return somewhat
different results in any case, since the contains() search isn't
anchored at the start of the word, but the problem I'm reporting is that
the latter returns elements that don't contain an 'x' at all.
I'd be grateful for suggestions about how to get the full-text indexer
to treat as alphabetic characters items that Java doesn't recognize as such.
Sincerely,
David
djbpitt+xml@...
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open