unicode 5.1 again

View: New views
3 Messages — Rating Filter:   Alert me  

unicode 5.1 again

by Birnbaum, David J :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Some parts of this message have been removed. Learn more about Nabble's security policy.

Dear eXistentialists,

 

A few weeks ago I posted an inquiry about getting eXist to index new Unicode 5.1 characters properly. Since eXist relies on ischar() and the versions of Java running on our machines have only a Unicode 5.0 sense of what is and isn't a character, eXist was failing to treat the new 5.1 characters like the alphabetic characters they are. With help from the eXist developers, I patched SimpleTokenizer.java so that searches using the XPath contains() function and range indexes would work as I needed them to.

 

Today I tried to use the full-text index on the same files and ran into a nasty surprise: despite my patch, the full-text index is treating the new Unicode characters as if they were non-alphabetic. This means that when I search for a word that contains, for example, "xy", where "x" is a Unicode 5.1 character and "y" is an older Unicode character, //path[contains(.,'xy')] does what I want but //path[. &= 'xy*'] doesn't. I realize that these two expressions will return somewhat different results in any case, since the contains() search isn't anchored at the start of the word, but the problem I'm reporting is that the latter returns elements that don't contain an 'x' at all.

 

I'd be grateful for suggestions about how to get the full-text indexer to treat as alphabetic characters items that Java doesn't recognize as such.

 

Sincerely,

 

David

djbpitt+xml@...

 

 


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: unicode 5.1 again

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi David,

just a quick answer before I need to leave: the standard XPath
contains() function will *not* use SimpleTokenizer. It just compares
strings based on their codepoints. Contrary to that, the full-text index
tokenizes the string and uses SimpleTokenizer for that.

This probably means that your modification to SimpleTokenizer did never
work and you have to check the class. You could try to use
text:index-terms to print out the actual index contents (I'd like to
show an example, but I don't have time - searching the mail archive
should provide some help though).

Wolfgang


> A few weeks ago I posted an inquiry about getting eXist to index new
> Unicode 5.1 characters properly. Since eXist relies on ischar() and the
> versions of Java running on our machines have only a Unicode 5.0 sense
> of what is and isn't a character, eXist was failing to treat the new 5.1
> characters like the alphabetic characters they are. With help from the
> eXist developers, I patched SimpleTokenizer.java so that searches using
> the XPath contains() function and range indexes would work as I needed
> them to.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: unicode 5.1 again

by Birnbaum, David J :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Dear Wolfgang (cc eXist-open),

> This probably means that your modification to SimpleTokenizer did never
> work and you have to check the class. You could try to use
> text:index-terms to print out the actual index contents (I'd like to
> show an example, but I don't have time - searching the mail archive
> should provide some help though).

Thank you for the quick response. I found a model for text:index-terms on the mailing list, modified it to fit my project, and got what appears to be inconsistent results, as follows.

1. I'm running Sun Java version 1.6.0_05 on a Windows XP machine.

2. I modified SimpleTokenizer as follows:

54c54,56
<                       } else if (Character.isLetter(ch) || is_mark(ch) || nonBreakingChar(ch) || (allowWildcards && isWildcard(ch))) {
---
>                       } else if (Character.isLetter(ch) || is_mark(ch) || nonBreakingChar(ch)
>                       || (ch == '\u2E3A') || (ch >= '\uA640' && ch <= '\uA897')
>                       || (allowWildcards && isWildcard(ch))) {

I then recompiled.

3. I first searched for index entries beginning with u+0430 followed by u+A641. The first of these characters has been in Unicode since version 1.0 (1991); the second was added when version 5.1 was released last month, and for that reason is not recognized as alphabetic by Java on my machine. The modification to SimpleTokenizer above is intended to tell the indexer to index on this character (and selected others added in Unicode 5.1). This search correctly returned all seven items that begin with this sequence, which I think means that my index recognizes u+A641 as an indexable character.

4. I then reversed the sequence and looked for index entries beginning with u+A641 followed by u+0430. There should be some, since there are words in the document that begin this way. It returned nothing. This seems to tell me that my index does not recognize u+A641 as indexable.

The results of #3 and #4 above look inconsistent to me. I had hoped that my modification would have caused u+A641 to be indexable (results #3, which is what I wanted) or not (results #4, which would mean that my modification had failed). I don't understand how u+A641 can be indexable inside a string but not at the beginning. Is there perhaps an additional modification (or even more than one) that I needed to make to SimpleTokenizer? Am I catching the new characters only when they follow other alphabetic characters, but missing them when they *begin* an indexable string?

Sincerely,

David
djbpitt+xml@...


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open