wizard for search in Lucene

View: New views
4 Messages — Rating Filter:   Alert me  

wizard for search in Lucene

by Albert Juhe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I want to make a wizard that can help to find n-grams terms.
For example:
If i want to search History, after write it the system propose you the following searches:
history europe
history spain
history .....
Consulting the terms indexed.

Does it exits in Lucene?

thank you,
Albert

Re: wizard for search in Lucene

by Asbjørn A. Fellinghaug :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Albert Juhe:

>
> Hi,
>
> I want to make a wizard that can help to find n-grams terms.
> For example:
> If i want to search History, after write it the system propose you the
> following searches:
> history europe
> history spain
> history .....
> Consulting the terms indexed.
>
> Does it exits in Lucene?

Hi.

I interpret your question in such a way that you want autocompletion in
your search system? In that case, I believe there are some Analyzer's
which does this in the 'contrib' package. Also, I've created an Analyzer
which creates "bigrams" (n-gram of size 2) in my master thesis.
Feel free to download it from this page:
http://asbjorn.fellinghaug.com/blog/2008/08/the-code-for-my-master-thesis/

Also, have a look at the package org.apache.lucene.analysis.ngram:
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/ngram/package-summary.html

--
Asbjørn A. Fellinghaug
asbjorn@...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: wizard for search in Lucene

by Aleksander M. Stensby :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

 From what I can understand, you want to insert the word "history" and then  
get proposed "related" terms in combination with your input query.
In essense this would be to do a "look-up" on top-terms in the subset of  
documents matching the initial query "history". Exactly how you could do  
this is a bit uncertain from my knowledge, but I suggest you read up on  
term-frequency and the tf-idf scheme.

Also: take a look at the org.apache.lucene.search.similar package:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/similar/package-summary.html
and read the motivation email listed in the first segment of
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/similar/MoreLikeThis.html

I couldn't really see how you would autocomplete after the word history  
without listing a bunch of un-interesting terms as suggestions... But i  
might be wrong... Of course, if it was autocompletion you were looking  
for¸ Asbjørn answered that one just fine:)

Best regards,
  Aleksander M. Stensby


On Thu, 09 Oct 2008 18:49:26 +0200, Asbjørn A. Fellinghaug  
<asbjorn@...> wrote:

> Albert Juhe:
>>
>> Hi,
>>
>> I want to make a wizard that can help to find n-grams terms.
>> For example:
>> If i want to search History, after write it the system propose you the
>> following searches:
>> history europe
>> history spain
>> history .....
>> Consulting the terms indexed.
>>
>> Does it exits in Lucene?
>
> Hi.
>
> I interpret your question in such a way that you want autocompletion in
> your search system? In that case, I believe there are some Analyzer's
> which does this in the 'contrib' package. Also, I've created an Analyzer
> which creates "bigrams" (n-gram of size 2) in my master thesis.
> Feel free to download it from this page:
> http://asbjorn.fellinghaug.com/blog/2008/08/the-code-for-my-master-thesis/
>
> Also, have a look at the package org.apache.lucene.analysis.ngram:
> http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/ngram/package-summary.html
>



--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
aleksander.stensby@...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@...
For additional commands, e-mail: java-user-help@...


Re: wizard for search in Lucene

by Albert Juhe :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

This is my first version, it isn't fast, because I want to get this information without modifying index.
Now I'm working to improve it (including freeling).

public String docsTerme(IndexReader reader, String terme) {
        String resultat = "";
        TermPositions tP;
        ArrayList alDocs = new ArrayList();
        long start = new Date().getTime();
        int veinsTrobats = 0; //neightbours find it

        //Where is the term
        try {
            tP = reader.termPositions(new Term("contingut", terme)); //Documents where the term is found.
            while (tP.next()) {
                infoTerme it = new infoTerme(terme, tP.doc(), tP.freq());
                resultat += it.toString();
                for (int i = 0; i < it.getFrequencia(); i++) {
                    it.add(tP.nextPosition());
                }
                alDocs.add(it); //we store: term, document id, positions
                resultat += "(" + it.toStringPosicions() + ")<br/>";
            }

        } catch (IOException e) {
            System.out.println("Error trobant documents termes: " + e);
            return null;
        }

        //Terms in a document
        for (int i = 0; i < alDocs.size(); i++) {
            infoTerme iT = (infoTerme) alDocs.get(i); //We need term,id document and positions
            resultat += "<br/>" + iT.getId_document() + ":<br/>"; //Id document
            try {
                TermFreqVector[] tfv = reader.getTermFreqVectors(iT.getId_document()); //All the terms found in a document
                int j = 0;
                String[] llistatTermes = tfv[j].getTerms();
                int paraulesAnalitzades = 0;
                veinsTrobats = 0;
                while (veinsTrobats < iT.getFrequencia() && paraulesAnalitzades < llistatTermes.length) {
                    resultat += "," + llistatTermes[paraulesAnalitzades];
                    TermPositions termP = reader.termPositions(new Term("contingut", llistatTermes[paraulesAnalitzades]));//Documents on apareix el terme
                    while (termP.next()) {
                        if (termP.doc() == iT.getId_document()) { //The word it's found in the same id document, maybe neightbours
                            boolean veins = false;
                            int ind = 0;
                            while (!veins && ind < termP.freq()) {
                                int posicio = termP.nextPosition();
                                if (iT.sonVeins(posicio)) {
                                    veins = true;
                                    resultat += "<br/>" + veinsTrobats + "/" + iT.getFrequencia() + " They are neightbours (proximity 1):" + iT.getTerme() + " i " + llistatTermes[paraulesAnalitzades] + "(" + posicio + ")<br/>";
                                    veinsTrobats++;
                                } else {
                                    ind++;
                                }
                            }
                        }
                    }
                    paraulesAnalitzades++;
                }

            } catch (IOException e) {
                System.out.println("Error I cant find terms: " + e);
                return null;
            }
        }
        long end = new Date().getTime();
        resultat += "<br/>Time elapsed: " + (end - start) + "ms";
        return resultat;
    }
infoTerme.java

thank you,
Albert

Aleksander M. Stensby wrote:
 From what I can understand, you want to insert the word "history" and then  
get proposed "related" terms in combination with your input query.
In essense this would be to do a "look-up" on top-terms in the subset of  
documents matching the initial query "history". Exactly how you could do  
this is a bit uncertain from my knowledge, but I suggest you read up on  
term-frequency and the tf-idf scheme.

Also: take a look at the org.apache.lucene.search.similar package:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/similar/package-summary.html
and read the motivation email listed in the first segment of
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/similar/MoreLikeThis.html

I couldn't really see how you would autocomplete after the word history  
without listing a bunch of un-interesting terms as suggestions... But i  
might be wrong... Of course, if it was autocompletion you were looking  
for¸ Asbjørn answered that one just fine:)

Best regards,
  Aleksander M. Stensby


On Thu, 09 Oct 2008 18:49:26 +0200, Asbjørn A. Fellinghaug  
<asbjorn@fellinghaug.com> wrote:

> Albert Juhe:
>>
>> Hi,
>>
>> I want to make a wizard that can help to find n-grams terms.
>> For example:
>> If i want to search History, after write it the system propose you the
>> following searches:
>> history europe
>> history spain
>> history .....
>> Consulting the terms indexed.
>>
>> Does it exits in Lucene?
>
> Hi.
>
> I interpret your question in such a way that you want autocompletion in
> your search system? In that case, I believe there are some Analyzer's
> which does this in the 'contrib' package. Also, I've created an Analyzer
> which creates "bigrams" (n-gram of size 2) in my master thesis.
> Feel free to download it from this page:
> http://asbjorn.fellinghaug.com/blog/2008/08/the-code-for-my-master-thesis/
>
> Also, have a look at the package org.apache.lucene.analysis.ngram:
> http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/ngram/package-summary.html
>



--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S
+47 41 22 82 72
aleksander.stensby@integrasco.no

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
LightInTheBox - Buy quality products at wholesale price!