|
View:
New views
20 Messages
—
Rating Filter:
Alert me
|
| < Prev | 1 - 2 | Next > |
|
|
Full-Text Search QuestionsI'm asking the eXist forum for advice on best practices in implementing full-text Google like searching across multiple monolithic documents. I have the following in my /db/collection.xconf file:
<fulltext default="all" attributes="false" alphanum="false"/>
If I invoke the query:
//*[. &= 'information']
I get results that include every ancestor node up the chain. What I want is the closest ancestor to the text which possesses an @KEY attribute.
A colleague of mine doing research on this topic eventually came up with a query using a set operation to exclude all nodes except CHAPTER, PGBLK and TASK. This isn't entirely right because the TASK is a child of PGBLK which is a child of CHAPTER. I only want the CHAPTER if the text node is an immediate child of CHAPTER, not when it's contained in the descendant TASK or PGBLK. I also find this a rather complex operation to implement full-text searching. The opinion of my colleague is that I should burst the document into fragments so that this set operation can be eliminated. I don't want to do that just yet as my hope is full-text search in eXist can be made simpler.
First Requirement: Generate a list of the the closest ancestor nodes to the match text which possesses an @KEY attribute.
Second Requirement: I need to page the results 10 at a time.
Third Requirement: I need a hit count per @KEY anchor node.
Fourth Requirement (optional): I need the result sorted descending by hit count.
My colleague's solution to these two issues involves sorting the result and taking a subsequence of 10 items and then counting the number of instances in the result for each of the 10 nodes. It's all a bit crazy and quite slow. There has to be a better way. His complete script is as follows:
(: First, this runs findRoots(), which finds all the leaf elements that contain a match, and then converts that to the set of CHAPTER, PGBLK, and TASK elements that contain matches. This is roughly equivalent to finding the matched in the virtually burst documents.
findLocalHits() then allows us to find the nodes that match within each root. This is used to provide a hit count as well as a list of matching nodes.
sort() puts the nodes in order.
This is fairly complex to create and to understand because of the way full-text search works. Full text search matches on any node where any node beneath it contains a match. To make this work in a full AMM, we have to filter the ancestor nodes out from the leaf collection, then walk back up the ancestors to find the closest one to the match.
Because of the complexity of the queries, we run into a performance problem at the point where we get the hit count. We have to find the number of hits and then sort on them. The way this works, that can't be optimized into the initial search query. :)
declare function local:findRoots($s as xs:string) { ((collection('amm')/descendant::element()[. &= $s]) except (collection('amm')/descendant::element()[. &= $s]/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1] };
declare function local:findLocalHits($root, $s as xs:string) { for $i in ($root/descendant::element()[. &= $s]) except ($root/descendant::element()[. &= $s]/ancestor::element()) where $root eq $i/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1] return $i };
declare function local:sort($roots, $s) { for $i in $roots order by count(local:findLocalHits($i, $s)) descending return $i };
declare function local:search($s as xs:string, $loc as xs:integer) { <root count="{count(local:findRoots($s))}"> { for $i in subsequence(local:sort(local:findRoots($s), $s), $loc, 20) let $hits := local:findLocalHits($i, $s) return <match tag="{name($i)}" key="{$i/@KEY}" title="{$i/TITLE}" hits="{count($hits)}"> { for $j in $hits return <hit>{$j}</hit> } </match> } </root> };
local:search("information", 20) ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsI believe that full-text is returning the single closest node and not repeating the hit for each ancestor. So the except logic my colleague included is unnecessary.
I'm looking at creating a full-text index entry that returns @KEY nodes. Something like this:
<create path="*[@KEY]" type="xs:string"/>
Is this the right approach?
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsI ran this query:
let $c := //*[. &= 'information']
for $x in $c return $x/ancestor::*[@KEY][1] and it took 432447ms finding 11743 items.
This query:
let $c := //*[. &= 'information']
for $x in $c return $x takes 937ms.
So why does it take so long finding the ancestor?
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> let $c := //*[. &= 'information']
> for $x in $c return $x/ancestor::*[@KEY][1] The "for" loop is unnecessary and expensive: let $c := //*[. &= 'information'] return $c/ancestor::*[@KEY][1] should produce the same result. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> I'm looking at creating a full-text index entry that returns @KEY nodes.
> Something like this: > > <create path="*[@KEY]" type="xs:string"/> No, the index definition syntax is much simpler than XPath and doesn't allow filters. Either define an index on path="@KEY" or create one on qname="@KEY". The qname definition is usually faster as it is better supported by the optimizer. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsHi,
I won't give a complete answer, just a few hints... Todd Gochenour a écrit : > I'm asking the eXist forum This is not a forum (where the users go by themselves to read your mail) ; it's a maling-list (where your mail goes by itself to the users). Never mind :-) > declare function local:findRoots($s as xs:string) > > { > > ((collection('amm')/descendant::element()[. &= $s]) except > (collection('amm')/descendant::element()[. &= > $s]/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', > 'TASK')][1] > > }; Why not factorize : (collection('amm')/descendant::element()[. &= $s]) ... bind it to a variable, say $i, reference it, an thus avoid double evaluation, i.e. : let $i := ((collection('amm')/descendant::element()[. &= $s]) return $i except $i/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1] From there, you might prefer working with : $i/ancestor::element()))/ancestor::CHAPTER | $i/ancestor::element()))/ancestor::PGBLK | $i/ancestor::element()))/ancestor::'TASK' ...rather than iterating over each context item (aka "."). (factorization is also possible here, of course) and then : collection('amm')/descendant::element()[. &= $s] [empty( ./ancestor::element()))/ancestor::CHAPTER | ./ancestor::element()))/ancestor::PGBLK | ./ancestor::element()))/ancestor::'TASK' )] Swapping the 2 filtering expressions might also be considered, depending of your selectivity. We hope such a choice can be done automatically in the future. My (untested) 2 cents, p.b. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsDoesn't an index on qname="@KEY" only index the content of the KEY attribute? Since I want the index to occur on CHAPTER, SECTION, SUBJECT, PGBLK, TASK, SUBTASK, and GRAPHIC, all which have @KEY attributes, I take it I'll have to specify each of these individually. I don't really want to do this as I'm trying for a generic solution that works with all documents without the configuration hassle, but I guess if I have to iterate all these nodes in the index to make this work, that's what I'll have to do. Maybe this isn't the way to go. Maybe I should leave the index on all nodes and try to filter in the query. Maybe...
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> Doesn't an index on qname="@KEY" only index the content of the KEY
> attribute? Since I want the index to occur on CHAPTER, SECTION, SUBJECT, > PGBLK, TASK, SUBTASK, and GRAPHIC, all which have @KEY attributes, I take it > I'll have to specify each of these individually. Yes, that's true. It will certainly lead to a huge index and a complex index definition document (we should implement some shortcuts here), but the expected performance gain might be big enough to justify the waste of disk space. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsThe query:
let $c := //*[. &= 'information']
return $c/ancestor::*[@KEY][1] ...took 137556ms and returned only one item.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> let $c := //*[. &= 'information']
> return $c/ancestor::*[@KEY][1] > > ...took 137556ms and returned only one item. Ok, that can't work then. Anyway, I confirm that $c/ancestor::*[@KEY] is much too slow. Have to check with a profiler. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsI have configured my index as follows:
<collection xmlns="http://exist-db.org/collection-config/1.0">
<index> <fulltext default="none" attributes="false" alphanum="false"> <create qname="PGBLK"/> </fulltext> <create qname="@KEY" type="xs:string"/> <create qname="@ATACODE" type="xs:string"/> <create qname="@id" type="xs:string"/> </index> </collection> The only node with a full-text index is PGBLK.
I issued the query:
let $c := //*[. &= 'information'] return $c
and I get 0 hits.
But I know the work "information" exists under a PGBLK, so why isn't it producing a hit?
If I understand the @content='mixed' attribute correctly, the difference is child nodes are treated as whitespace normally and are ignored when @content='mixed'. Based upon my reading, I want tags to be treated as word boundaries, but now I'm wondering if text in a child node of a PGBLK isn't even included in the index unless @content='mixed' is set. I'm off to try this now...
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions@content='mixed' on PGBLK also does not produce any hits. I'm must be missing something simple.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsThe query:
let $c := //*[. &= 'information']
for $x in $c return $x ...returns in 938ms returning 10 items as configured in the Admin Client Query Dialog.
But the query:
let $c := //*[. &= 'information']
for $x in $c return element {name($x)} {<x/>} times out by exceeding the 10000 size limit. Evidently the display max value in the Query Dialog no longer applies.
If I issue this query:
let $c := //*[. &= 'information']
for $x in $c[position() < 10] return element {name($x)} {<x/>} Then I get back a response in 297ms.
Queries don't always work the way I think they should, I'm discovering. ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search QuestionsIt appears that the *[@KEY] is the problem. If I invoke the query:
//*[. &= 'information']/ancestor::PGBLK
I get a response in 219ms.
If instead I invoke the query:
//*[. &= 'information']/ancestor::*[@KEY][1]
It takes 160320ms. Lovely.
Looks like I can only match one specific QName in the document rather than the first QNAME ancestor with an @KEY attribute. I don't believe this will fly with my higher-ups, unfortunately. Does this mean I have to burst the document into fragments? Shoot, I was so close to having it solved without this extra complexity.
------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> Looks like I can only match one specific QName in the document rather than
> the first QNAME ancestor with an @KEY attribute. No, eXist obviously has a problem with ancestor::*[@KEY]. The profiler shows that 90% of the time is spent in the attribute lookup, not the ancestor step. I still have to find out why. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Rest Character EncodingIt appears that certain characters are being encoded twice in exist
URIs. I have a file called "test.xml" within a collection called "test@test". The first time I tried to retrieve test.xml through the rest interface, I encoded it (UTF-8) and tried to retrieve it unsuccessfully. After playing around a bit I went through the exist webapp, browsed through my collections and just clicked on the test.xml link to see how the webapp created the URI. To my surprise, the "@" symbol was encoded as %2540 and not %40. Has this bug been addressed or is it being addressed? If not, is this something I could fix and resubmit for you? Thanks, John Vogt ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions> No, eXist obviously has a problem with ancestor::*[@KEY]. The profiler
> shows that 90% of the time is spent in the attribute lookup, not the > ancestor step. I still have to find out why. I fixed this issue in trunk. The ancestor::*[@KEY] expression should now be very fast. Wolfgang ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Exist-open mailing list Exist-open@... https://lists.sourceforge.net/lists/listinfo/exist-open |
|
|
Re: Full-Text Search Questions |