« Return to Thread: [ANN] Searchable Plugin 0.4.1 released

Re: Re: [ANN] Searchable Plugin 0.4.1 released

by Ted Dunning-3 :: Rate this Message:

Reply to Author | View in Thread




On 4/22/08 5:05 PM, "Barzilai Spinak" <barcho@...> wrote:
 
> I think we are using a different definition of term frequency. To me
> it's the number of occurrences of the *term*. However, the termFreqs
> method is returning the number of *documents* (instances of domain
> classes, in Grails) where the term occurs, disregarding the occurrences
> of the term itself.

Your definition is relatively natural, but it against common practice in
text retrieval.  Experiments on retrieval performance have generally borne
out the value of the "document count" definition over the "word count"
definition that you suggest.  This probably has much to do with the average
size of the documents under test interacting with the fact that you want to
weight terms based on the prevailing frequency without much contribution
from documents that are particularly related to the term.

> Of course, when searching, a Paragraph object/document where the term
> "John" appears three times will rank higher than a Paragraph where it
> appears only once. So, of course, this information is stored somewhere
> in the index.

Only indirectly.  There is a per term weight vector stored on each document,
but the weights don't only depend on the number of occurrences of that term.
The details vary depending on how you index the document.  Some details are
available in the javadoc for Lucene's Similarity function.

> On a completely unrelated, but more important note:
>    How would you describe the query performance of Compass/Lucene versus
> searching in the relational database using normal GORM/HSQL?

For what it does, it is vastly faster.  If you want semi-structured data,
take Lucene.  If you want the best few elements of a ranked list (ranked
according to a Lucene computable score), choose Lucene.  If you want joins,
aggregates and referential integrity pick the RDBMS.



---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


 « Return to Thread: [ANN] Searchable Plugin 0.4.1 released