« Return to Thread: Full-Text Search Questions

Full-Text Search Questions

by ToddG :: Rate this Message:

Reply to Author | View in Thread

I'm asking the eXist forum for advice on best practices in implementing full-text Google like searching across multiple monolithic documents.  I have the following in my /db/collection.xconf file:
 
<fulltext default="all" attributes="false" alphanum="false"/>
 
If I invoke the query:
 
//*[. &= 'information']
 
I get results that include every ancestor node up the chain.  What I want is the closest ancestor to the text which possesses an @KEY attribute. 
 
A colleague of mine doing research on this topic eventually came up with a query using a set operation to exclude all nodes except CHAPTER, PGBLK and TASK. This isn't entirely right because the TASK is a child of PGBLK which is a child of CHAPTER.  I only want the CHAPTER if the text node is an immediate child of CHAPTER, not when it's contained in the descendant TASK or PGBLK.   I also find this a rather complex operation to implement full-text searching.  The opinion of my colleague is that I should burst the document into fragments so that this set operation can be eliminated.  I don't want to do that just yet as my hope is full-text search in eXist can be made simpler.
 
First Requirement: Generate a list of the the closest ancestor nodes to the match text which possesses an @KEY attribute. 
Second Requirement: I need to page the results 10 at a time.
Third Requirement: I need a hit count per @KEY anchor node.
Fourth Requirement (optional): I need the result sorted descending by hit count.
 
My colleague's solution to these two issues involves sorting the result and taking a subsequence of 10 items and then counting the number of instances in the result for each of the 10 nodes.  It's all a bit crazy and quite slow.  There has to be a better way.  His complete script is as follows:
 

(:

    First, this runs findRoots(), which finds all the leaf elements that contain a match, and then converts that to the set of CHAPTER, PGBLK, and TASK

    elements that contain matches.  This is roughly equivalent to finding the matched in the virtually burst documents.

   

    findLocalHits() then allows us to find the nodes that match within each root.  This is used to provide a hit count as well as a list of matching nodes.

   

    sort() puts the nodes in order.

   

    This is fairly complex to create and to understand because of the way full-text search works.  Full text search matches on any node where any node beneath

    it contains a match.  To make this work in a full AMM, we have to filter the ancestor nodes out from the leaf collection, then walk back up the ancestors to

    find the closest one to the match. 

   

    Because of the complexity of the queries, we run into a performance problem at the point where we get the hit count.  We have to find the number of hits and

    then sort on them.  The way this works, that can't be optimized into the initial search query.

:)

 

declare function local:findRoots($s as xs:string)

{

   ((collection('amm')/descendant::element()[. &= $s]) except (collection('amm')/descendant::element()[. &= $s]/ancestor::element()))/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1]

};

 

declare function local:findLocalHits($root, $s as xs:string)

{

    for $i in ($root/descendant::element()[. &= $s]) except ($root/descendant::element()[. &= $s]/ancestor::element())

    where $root eq $i/ancestor::*[name(.) = ('CHAPTER', 'PGBLK', 'TASK')][1]

    return $i

};

 

declare function local:sort($roots, $s)

{

    for $i in $roots

    order by count(local:findLocalHits($i, $s)) descending

    return $i

};

 

declare function local:search($s as xs:string, $loc as xs:integer)

{

    <root count="{count(local:findRoots($s))}">

    {

        for $i in subsequence(local:sort(local:findRoots($s), $s), $loc, 20)

        let $hits := local:findLocalHits($i, $s)

        return

            <match tag="{name($i)}" key="{$i/@KEY}" title="{$i/TITLE}" hits="{count($hits)}">

                {

                    for $j in $hits

                    return <hit>{$j}</hit>

                }

            </match>

    }

    </root>

};

 

local:search("information", 20)

 
 
 
 
 
 
 

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

 « Return to Thread: Full-Text Search Questions

LightInTheBox - Buy quality products at wholesale price