|
View:
New views
3 Messages
—
Rating Filter:
Alert me
|
|
|
unicode 5.1 againDear eXistentialists, A few weeks ago I posted an inquiry about getting eXist to
index new Unicode 5.1 characters properly. Since eXist relies on ischar() and
the versions of Java running on our machines have only a Unicode 5.0 sense of
what is and isn't a character, eXist was failing to treat the new 5.1
characters like the alphabetic characters they are. With help from the eXist
developers, I patched SimpleTokenizer.java so that searches using the XPath
contains() function and range indexes would work as I needed them to. Today I tried to use the full-text index on the same files
and ran into a nasty surprise: despite my patch, the full-text index is
treating the new Unicode characters as if they were non-alphabetic. This means
that when I search for a word that contains, for example, "xy", where
"x" is a Unicode 5.1 character and "y" is an older Unicode
character, //path[contains(.,'xy')] does what I want but //path[. &= 'xy*']
doesn't. I realize that these two expressions will return somewhat different
results in any case, since the contains() search isn't anchored at the start of
the word, but the problem I'm reporting is that the latter returns elements
that don't contain an 'x' at all. I'd be grateful for suggestions about how to get the
full-text indexer to treat as alphabetic characters items that Java doesn't
recognize as such. Sincerely, David ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: unicode 5.1 againHi David,
just a quick answer before I need to leave: the standard XPath contains() function will *not* use SimpleTokenizer. It just compares strings based on their codepoints. Contrary to that, the full-text index tokenizes the string and uses SimpleTokenizer for that. This probably means that your modification to SimpleTokenizer did never work and you have to check the class. You could try to use text:index-terms to print out the actual index contents (I'd like to show an example, but I don't have time - searching the mail archive should provide some help though). Wolfgang > A few weeks ago I posted an inquiry about getting eXist to index new > Unicode 5.1 characters properly. Since eXist relies on ischar() and the > versions of Java running on our machines have only a Unicode 5.0 sense > of what is and isn't a character, eXist was failing to treat the new 5.1 > characters like the alphabetic characters they are. With help from the > eXist developers, I patched SimpleTokenizer.java so that searches using > the XPath contains() function and range indexes would work as I needed > them to. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: unicode 5.1 againDear Wolfgang (cc eXist-open),
> This probably means that your modification to SimpleTokenizer did never > work and you have to check the class. You could try to use > text:index-terms to print out the actual index contents (I'd like to > show an example, but I don't have time - searching the mail archive > should provide some help though). Thank you for the quick response. I found a model for text:index-terms on the mailing list, modified it to fit my project, and got what appears to be inconsistent results, as follows. 1. I'm running Sun Java version 1.6.0_05 on a Windows XP machine. 2. I modified SimpleTokenizer as follows: 54c54,56 < } else if (Character.isLetter(ch) || is_mark(ch) || nonBreakingChar(ch) || (allowWildcards && isWildcard(ch))) { --- > } else if (Character.isLetter(ch) || is_mark(ch) || nonBreakingChar(ch) > || (ch == '\u2E3A') || (ch >= '\uA640' && ch <= '\uA897') > || (allowWildcards && isWildcard(ch))) { I then recompiled. 3. I first searched for index entries beginning with u+0430 followed by u+A641. The first of these characters has been in Unicode since version 1.0 (1991); the second was added when version 5.1 was released last month, and for that reason is not recognized as alphabetic by Java on my machine. The modification to SimpleTokenizer above is intended to tell the indexer to index on this character (and selected others added in Unicode 5.1). This search correctly returned all seven items that begin with this sequence, which I think means that my index recognizes u+A641 as an indexable character. 4. I then reversed the sequence and looked for index entries beginning with u+A641 followed by u+0430. There should be some, since there are words in the document that begin this way. It returned nothing. This seems to tell me that my index does not recognize u+A641 as indexable. The results of #3 and #4 above look inconsistent to me. I had hoped that my modification would have caused u+A641 to be indexable (results #3, which is what I wanted) or not (results #4, which would mean that my modification had failed). I don't understand how u+A641 can be indexable inside a string but not at the beginning. Is there perhaps an additional modification (or even more than one) that I needed to make to SimpleTokenizer? Am I catching the new characters only when they follow other alphabetic characters, but missing them when they *begin* an indexable string? Sincerely, David djbpitt+xml@... ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
| Free Forum Powered by Nabble | Forum Help |