Full-Text Search Questions

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm asking the eXist forum for advice on best practices in implementing full-text Google like searching across multiple monolithic documents.  I have the following in my /db/collection.xconf file:
 
<fulltext default="all" attributes="false" alphanum="false"/>
 
If I invoke the query:
 
//*[. &= 'information']
 
I get results that include every ancestor node up the chain.  What I want is the closest ancestor to the text which possesses an @KEY attribute. 
 
A colleague of mine doing research on this topic eventually came up with a query using a set operation to exclude all nodes except CHAPTER, PGBLK and TASK. This isn't entirely right because the TASK is a child of PGBLK which is a child of CHAPTER.  I only want the CHAPTER if the text node is an immediate child of CHAPTER, not when it's contained in the descendant TASK or PGBLK.   I also find this a rather complex operation to implement full-text searching.  The opinion of my colleague is that I should burst the document into fragments so that this set operation can be eliminated.  I don't want to do that just yet as my hope is full-text search in eXist can be made simpler.
 
First Requirement: Generate a list of the the closest ancestor nodes to the match text which possesses an @KEY attribute. 
Second Requirement: I need to page the results 10 at a time.
Third Requirement: I need a hit count per @KEY anchor node.
Fourth Requirement (optional): I need the result sorted descending by hit count.
 
My colleague's solution to these two issues involves sorting the result and taking a subsequence of 10 items and then counting the number of instances in the result for each of the 10 nodes.  It's all a bit crazy and quite slow.  There has to be a better way.  His complete script is as follows:
 

(:

    First, this runs findRoots(), which finds all the leaf elements that contain a match, and then converts that to the set of CHAPTER, PGBLK, and TASK

    elements that contain matches.  This is roughly equivalent to finding the matched in the virtually burst documents.

   

    findLocalHits() then allows us to find the nodes that match within each root.  This is used to provide a hit count as well as a list of matching nodes.

   

    sort() puts the nodes in order.

   

    This is fairly complex to create and to understand because of the way full-text search works.  Full text search matches on any node where any node beneath

    it contains a match.  To make this work in a full AMM, we have to filter the ancestor nodes out from the leaf collection, then walk back up the ancestors to

    find the closest one to the match. 

   

    Because of the complexity of the queries, we run into a performance problem at the point where we get the hit count.  We have to find the number of hits and

    then sort on them.  The way this works, that can't be optimized into the initial search query.

:)

 

declare function local:findRoots($s as xs:string)

{

   ((collection('amm')/descendant::element()[. &= $s]) except (collection('amm')/descendant::element()[. &= $s]/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1]

};

 

declare function local:findLocalHits($root, $s as xs:string)

{

    for $i in ($root/descendant::element()[. &= $s]) except ($root/descendant::element()[. &= $s]/ancestor::element())

    where $root eq $i/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1]

    return $i

};

 

declare function local:sort($roots, $s)

{

    for $i in $roots

    order by count(local:findLocalHits($i, $s)) descending

    return $i

};

 

declare function local:search($s as xs:string, $loc as xs:integer)

{

    <root count="{count(local:findRoots($s))}">

    {

        for $i in subsequence(local:sort(local:findRoots($s), $s), $loc, 20)

        let $hits := local:findLocalHits($i, $s)

        return

            <match tag="{name($i)}" key="{$i/@KEY}" title="{$i/TITLE}" hits="{count($hits)}">

                {

                    for $j in $hits

                    return <hit>{$j}</hit>

                }

            </match>

    }

    </root>

};

 

local:search("information", 20)

 
 
 
 
 
 
 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I believe that full-text is returning the single closest node and not repeating the hit for each ancestor.  So the except logic my colleague included is unnecessary. 
 
I'm looking at creating a full-text index entry that returns @KEY nodes.  Something like this:
 
<create path="*[@KEY]" type="xs:string"/>
 
Is this the right approach?

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I ran this query:
 
let $c := //*[. &= 'information']
for $x in $c return $x/ancestor::*[@KEY][1]
 
and it took 432447ms finding 11743 items.
 
This query:
 
let $c := //*[. &= 'information']
for $x in $c return $x
 
takes 937ms.
 
So why does it take so long finding the ancestor?

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> let $c := //*[. &= 'information']
> for $x in $c return $x/ancestor::*[@KEY][1]

The "for" loop is unnecessary and expensive:

let $c := //*[. &= 'information']
return $c/ancestor::*[@KEY][1]

should produce the same result.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I'm looking at creating a full-text index entry that returns @KEY nodes.
> Something like this:
>
> <create path="*[@KEY]" type="xs:string"/>

No, the index definition syntax is much simpler than XPath and doesn't
allow filters. Either define an index on path="@KEY" or create one on
qname="@KEY". The qname definition is usually faster as it is better
supported by the optimizer.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Pierrick Brihaye :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I won't give a complete answer, just a few hints...

Todd Gochenour a écrit :

> I'm asking the eXist forum

This is not a forum (where the users go by themselves to read your mail)
; it's a maling-list (where your mail goes by itself to the users).
Never mind :-)

> declare function local:findRoots($s as xs:string)
>
> {
>
>    ((collection('amm')/descendant::element()[. &= $s]) except
> (collection('amm')/descendant::element()[. &=
> $s]/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK',
> 'TASK')][1]
>
> };

Why not factorize :
(collection('amm')/descendant::element()[. &= $s])
... bind it to a variable, say $i, reference it, an thus avoid double
evaluation, i.e. :

let $i := ((collection('amm')/descendant::element()[. &= $s])
return $i except $i/ancestor::element()))/ancestor::*[name(.) =
('CHAPTER', 'PGBLK', 'TASK')][1]

 From there, you might prefer working with :

$i/ancestor::element()))/ancestor::CHAPTER |
$i/ancestor::element()))/ancestor::PGBLK |
$i/ancestor::element()))/ancestor::'TASK'

...rather than iterating over each context item (aka ".").

(factorization is also possible here, of course)

and then :

collection('amm')/descendant::element()[. &= $s] [empty(
./ancestor::element()))/ancestor::CHAPTER |
./ancestor::element()))/ancestor::PGBLK |
./ancestor::element()))/ancestor::'TASK'
)]

Swapping the 2 filtering expressions might also be considered, depending
of your selectivity. We hope such a choice can be done automatically in
the future.

My (untested) 2 cents,

p.b.


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Doesn't an index on qname="@KEY" only index the content of the KEY attribute?  Since I want the index to occur on CHAPTER, SECTION, SUBJECT, PGBLK, TASK, SUBTASK, and GRAPHIC, all which have @KEY attributes, I take it I'll have to specify each of these individually.  I don't really want to do this as I'm trying for a generic solution that works with all documents without the configuration hassle, but I guess if I have to iterate all these nodes in the index to make this work, that's what I'll have to do.  Maybe this isn't the way to go.  Maybe I should leave the index on all nodes and try to filter in the query. Maybe...
 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Doesn't an index on qname="@KEY" only index the content of the KEY
> attribute?  Since I want the index to occur on CHAPTER, SECTION, SUBJECT,
> PGBLK, TASK, SUBTASK, and GRAPHIC, all which have @KEY attributes, I take it
> I'll have to specify each of these individually.

Yes, that's true. It will certainly lead to a huge index and a complex
index definition document (we should implement some shortcuts here),
but the expected performance gain might be big enough to justify the
waste of disk space.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The query:
 
let $c := //*[. &= 'information']
return $c/ancestor::*[@KEY][1]
 
...took 137556ms and returned only one item.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> let $c := //*[. &= 'information']
> return $c/ancestor::*[@KEY][1]
>
> ...took 137556ms and returned only one item.

Ok, that can't work then. Anyway, I confirm that $c/ancestor::*[@KEY]
is much too slow. Have to check with a profiler.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I have configured my index as follows:
 
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false" alphanum="false">
            <create qname="PGBLK"/>
        </fulltext>
        <create qname="@KEY" type="xs:string"/>
        <create qname="@ATACODE" type="xs:string"/>
        <create qname="@id" type="xs:string"/>
    </index>
</collection>
 
The only node with a full-text index is PGBLK. 
 
I issued the query:
 
let $c := //*[. &= 'information'] return $c
 
and I get 0 hits.
 
But I know the work "information" exists under a PGBLK, so why isn't it producing a hit?
 
If I understand the @content='mixed' attribute correctly, the difference is child nodes are treated as whitespace normally and are ignored when @content='mixed'.  Based upon my reading, I want tags to be treated as word boundaries, but now I'm wondering if text in a child node of a PGBLK isn't even included in the index unless @content='mixed' is set.   I'm off to try this now...

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

@content='mixed' on PGBLK also does not produce any hits.  I'm must be missing something simple.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The query:
 
let $c := //*[. &= 'information']
for $x in $c return $x
 
...returns in 938ms returning 10 items as configured in the Admin Client Query Dialog.
 
 
But the query:
 
let $c := //*[. &= 'information']
for $x in $c return element {name($x)} {<x/>}
 
times out by exceeding the 10000 size limit.  Evidently the display max value in the Query Dialog no longer applies.
 
If I issue this query:
 
let $c := //*[. &= 'information']
for $x in $c[position() < 10] return element {name($x)} {<x/>}
 
Then I get back a response in 297ms.
 
Queries don't always work the way I think they should, I'm discovering.
 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It appears that the *[@KEY] is the problem.  If I invoke the query:
 
//*[. &= 'information']/ancestor::PGBLK
 
I get a response in 219ms.
 
If instead I invoke the query:
 
//*[. &= 'information']/ancestor::*[@KEY][1]
 
It takes 160320ms.  Lovely.
 
Looks like I can only match one specific QName in the document rather than the first QNAME ancestor with an @KEY attribute.  I don't believe this will fly with my higher-ups, unfortunately.  Does this mean I have to burst the document into fragments?  Shoot, I was so close to having it solved without this extra complexity.

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Looks like I can only match one specific QName in the document rather than
> the first QNAME ancestor with an @KEY attribute.

No, eXist obviously has a problem with ancestor::*[@KEY]. The profiler
shows that 90% of the time is spent in the attribute lookup, not the
ancestor step. I still have to find out why.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Rest Character Encoding

by John Vogt :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

It appears that certain characters are being encoded twice in exist  
URIs.  I have a file called "test.xml" within a collection called  
"test@test".  The first time I tried to retrieve test.xml through the  
rest interface, I encoded it (UTF-8) and tried to retrieve it  
unsuccessfully.  After playing around a bit I went through the exist  
webapp, browsed through my collections and just clicked on the  
test.xml link to see how the webapp created the URI.  To my surprise,  
the "@" symbol was encoded as %2540 and not %40.

Has this bug been addressed or is it being addressed?  If not, is this  
something I could fix and resubmit for you?

Thanks,

John Vogt

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

>  No, eXist obviously has a problem with ancestor::*[@KEY]. The profiler
>  shows that 90% of the time is spent in the attribute lookup, not the
>  ancestor step. I still have to find out why.

I fixed this issue in trunk. The ancestor::*[@KEY] expression should
now be very fast.

Wolfgang

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Full-Text Search Questions

by ToddG :: Rate this Message: