The solution to search related problems: operators, quotes, phrase, chinese.

View: New views
4 Messages — Rating Filter:   Alert me  

The solution to search related problems: operators, quotes, phrase, chinese.

by zhuhuazha :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi Acarboni, Ticheler, GN developers:
Did you encounter the problems search related on web in Advanced Search?
 I did.
 
 1. Problem.
 These are the search related problems I found:
 1) operators: the operators( and, or, not ) can not take any effect.
 2) quotes:     also can not take any effect.
 3) the phrase query:  must use quotes, but quotes,.....
 4) the character query in Asian Language like chinese:
  can not get the exact result, GN found the metadata which includes  each character in query, not the query phrase.
  the effect is like: "any more", and Geonetwork found "any" and "more".
 
 2. WHY?
 ok, why? what are the reasons? I
 The analyzer is the main reason for the problems.
 In the java class file of services.main.Search,
 I saw that the query sentence will be send to MainUtil.splitWord function to split the word, like below. 
  if (any != null)
   any.setText(MainUtil.splitWord(any.getText()));
 Take a look at the splitWord function, it used StandardAnalyzer.
 public static String splitWord(String requestStr)
 {
  Analyzer a = new StandardAnalyzer();
  .....
 }
 We know, the StandardAnalyzer will filter some strings like "and", "or", "not", "as"...,
 and it also filter the quotes ("), so the return of this function will ignore the operator and quotes.
 As default operator "and", the GN will use "and" to query in Lucene.
 So,  the problems become.
 
 3.Solution.
 How to resolve that?
 Just do not use the StandardAnalyzer? No, we need it to analyze the query sentence, for example,
 the phrase in the quotes. So we must find the quotes before analyze, and send the phrase between
 quotes to analyzer. My solution can let the quotes, operators, phrase take effect,
 it can resolve the problem, implement the search function and Chinese involved. Below is my solution,
 
 if (any != null)
 { 
  any.setText( splitWord(any.getText()) );
 }
 
 Use the splitWord to replace the MainUtil.splitWord, and MainUtil.splitWord will be used in splitWord.
 Below is the splitWord function in Search.java
 
 //code from here, these code will be in .service.main.Search.java file
 private static final String OPER_AND = " and ";
 private static final String OPER_OR = " or ";
 private static final String OPER_NOT = " not ";
 
 private String splitWord( String strValue )
 { 
  //basic process string: trim, multi whitespace changed to one.
  String  strQuoteSg = "\'";
  String strQuoteDb = "\"";
  //single quote to double quote mark
  strValue = strValue.replaceAll( strQuoteSg, strQuoteDb);
  
  //trim
  strValue = strValue.trim();
  //union the continued whitespace to one single
  strValue = strValue.replaceAll("\\s\\s+", " ");
  
  //toLowerCase, the search is not case sensitive
  strValue = strValue.toLowerCase();
  
  if( strValue.length()>0 )
  {
   int nFirstIndex = strValue.indexOf(strQuoteDb);   
   
   if( nFirstIndex<0 )
   { 
    //no quotes, must use the operator and, or, not to supple the quotes    
    strValue = replaceComponent( strValue );        
   }   
   return splitString( strValue );
   
  }
  else
   return strValue;
 }
 // " " --> " and "
 private String replaceComponent( String strValue )
 { 
  String strQuoteDb = "\"";
  String strWhitespace = " ";
  
  //add quotes to head and tail
  strValue = strQuoteDb +strValue+ strQuoteDb;
  
  //find the whitespace index
  int nIndex = strValue.indexOf( strWhitespace );  
  if( nIndex<0 )
   return strValue;
  else
  {
   //and ,or ,not
   strValue = checkKeyword( strValue );
   
   //if not inclucde, just use add as default.
   if( strValue.contains( OPER_AND ) || strValue.contains( OPER_OR )
     || strValue.contains( OPER_NOT ))
   {
    return strValue;
   }
   else
   {
    return strValue.replace( strWhitespace,
      strQuoteDb+ strWhitespace+"and"+strWhitespace+strQuoteDb);
   }
  }
 }
 
 private String checkKeyword(String strValue)
 {
  strValue = checkKeywordComponent( strValue, OPER_AND );
  strValue = checkKeywordComponent( strValue, OPER_OR );
  strValue = checkKeywordComponent( strValue, OPER_NOT );
  return strValue;
 }
  
 //add quotes to the head and tail of the string 
 //the strValue and keyword must be lowercase
 private String checkKeywordComponent(String strValue, String keyword)
 { 
  StringBuffer sb = new StringBuffer();
  sb.append( strValue );
  
  int nIndex = sb.indexOf( keyword );
  int offset = keyword.length();
  
  String strQuoteDb = "\"";
  while( nIndex >=0 )
  {
   //check the quote   
   if( !sb.substring( nIndex-1, nIndex).equals( strQuoteDb ))
   {
    sb.insert( nIndex, strQuoteDb );
    offset++;
   }   
   if( !sb.substring( nIndex+offset, nIndex+offset+1).equals( strQuoteDb ))
   {
    sb.insert( nIndex+offset, strQuoteDb );
   }
   nIndex = sb.indexOf(keyword, nIndex+2 );
   offset = keyword.length();
  }
  
  return sb.toString();
 }
 
 private String splitString(String strValue)
 { 
  //clear the whitespace of head and tail
  strValue = strValue.trim();
  //continued whitespace to one 
  strValue = strValue.replaceAll("\\s\\s+", " ");
  //add quotes for operator: and ,or ,not
  strValue = checkKeyword( strValue );
  
  String strQuoteDb = "\"";
    
  StringBuffer sb = new StringBuffer();
  int nStartIndex = 0;
  int nFirstIndex = strValue.indexOf( strQuoteDb );
  
  while( nFirstIndex>=0 )
  { 
   sb.append( strValue.substring( nStartIndex, nFirstIndex+1 ) );
   
   int nSecondQuote = strValue.indexOf(strQuoteDb, nFirstIndex+1 );
   
   nStartIndex = (nFirstIndex<strValue.length()-1)? nFirstIndex+1 : strValue.length();
   if( nSecondQuote<0 )  //the last quote  not exist
   {
    String strLast = strValue.substring(nStartIndex, strValue.length() );
    strLast = MainUtil.splitWord( strLast );
    sb.append( strLast );
    sb.append( strQuoteDb );
    nStartIndex = strValue.length()-1;
    break;
   }
   else
   {
    String strLast = strValue.substring( nStartIndex, nSecondQuote );
    strLast = MainUtil.splitWord( strLast );
    sb.append( strLast );
    sb.append( strQuoteDb );
    nStartIndex = nSecondQuote+1;
   }
   //find the third "
   nFirstIndex = strValue.indexOf( strQuoteDb, nStartIndex );
  }
  
  if( nStartIndex+1 < strValue.length() )
  {
   sb.append( strValue.substring(nStartIndex+1));
  }  
  return sb.toString();
 } 
 You can have a test.
 
 4. COMMIT?
 who can commit this to GN source?
 Or how can i commit this ?


雅虎邮箱,您的终生邮箱!
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@...
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Re: The solution to search related problems: operators, quotes, phrase, chinese.

by heikki doeleman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

hi Zhuhua,

as for your problems 1 - 3, you're right, even though the documentation states that "or", "not" and "phrase" operators in queries are supported, they are not. If we'd like these operators in that way (i.e. operating in a query from a single search field), we should use a Lucene QueryParser, and we don't.

However some time ago I alternatively implemented "or", "without" and "phrase" queries by adding extra search fields for each of these -- kind of like Google's advanced search page where you also have separate input fields for this. These fields are normally "hidden" (invisible) in the advanced search section; if you set their "display" property to either "inline" or "block", you should be able to use them straight away.

As for your 4th problem, I'm not entirely sure what you mean..

Kind regards
Heikki Doeleman




On Tue, Sep 9, 2008 at 6:19 AM, zhuhua zha <zhuhuazha2004@...> wrote:
hi Acarboni, Ticheler, GN developers:
Did you encounter the problems search related on web in Advanced Search?
 I did.
 
 1. Problem.
 These are the search related problems I found:
 1) operators: the operators( and, or, not ) can not take any effect.
 2) quotes:     also can not take any effect.
 3) the phrase query:  must use quotes, but quotes,.....
 4) the character query in Asian Language like chinese:
  can not get the exact result, GN found the metadata which includes  each character in query, not the query phrase.
  the effect is like: "any more", and Geonetwork found "any" and "more".
 
 2. WHY?
 ok, why? what are the reasons? I
 The analyzer is the main reason for the problems.
 In the java class file of services.main.Search,
 I saw that the query sentence will be send to MainUtil.splitWord function to split the word, like below. 
  if (any != null)
   any.setText(MainUtil.splitWord(any.getText()));
 Take a look at the splitWord function, it used StandardAnalyzer.
 public static String splitWord(String requestStr)
 {
  Analyzer a = new StandardAnalyzer();
  .....
 }
 We know, the StandardAnalyzer will filter some strings like "and", "or", "not", "as"...,
 and it also filter the quotes ("), so the return of this function will ignore the operator and quotes.
 As default operator "and", the GN will use "and" to query in Lucene.
 So,  the problems become.
 
 3.Solution.
 How to resolve that?
 Just do not use the StandardAnalyzer? No, we need it to analyze the query sentence, for example,
 the phrase in the quotes. So we must find the quotes before analyze, and send the phrase between
 quotes to analyzer. My solution can let the quotes, operators, phrase take effect,
 it can resolve the problem, implement the search function and Chinese involved. Below is my solution,
 
 if (any != null)
 { 
  any.setText( splitWord(any.getText()) );
 }
 
 Use the splitWord to replace the MainUtil.splitWord, and MainUtil.splitWord will be used in splitWord.
 Below is the splitWord function in Search.java
 
 //code from here, these code will be in .service.main.Search.java file
 private static final String OPER_AND = " and ";
 private static final String OPER_OR = " or ";
 private static final String OPER_NOT = " not ";
 
 private String splitWord( String strValue )
 { 
  //basic process string: trim, multi whitespace changed to one.
  String  strQuoteSg = "\'";
  String strQuoteDb = "\"";
  //single quote to double quote mark
  strValue = strValue.replaceAll( strQuoteSg, strQuoteDb);
  
  //trim
  strValue = strValue.trim();
  //union the continued whitespace to one single
  strValue = strValue.replaceAll("\\s\\s+", " ");
  
  //toLowerCase, the search is not case sensitive
  strValue = strValue.toLowerCase();
  
  if( strValue.length()>0 )
  {
   int nFirstIndex = strValue.indexOf(strQuoteDb);   
   
   if( nFirstIndex<0 )
   { 
    //no quotes, must use the operator and, or, not to supple the quotes    
    strValue = replaceComponent( strValue );        
   }   
   return splitString( strValue );
   
  }
  else
   return strValue;
 }
 // " " --> " and "
 private String replaceComponent( String strValue )
 { 
  String strQuoteDb = "\"";
  String strWhitespace = " ";
  
  //add quotes to head and tail
  strValue = strQuoteDb +strValue+ strQuoteDb;
  
  //find the whitespace index
  int nIndex = strValue.indexOf( strWhitespace );  
  if( nIndex<0 )
   return strValue;
  else
  {
   //and ,or ,not
   strValue = checkKeyword( strValue );
   
   //if not inclucde, just use add as default.
   if( strValue.contains( OPER_AND ) || strValue.contains( OPER_OR )
     || strValue.contains( OPER_NOT ))
   {
    return strValue;
   }
   else
   {
    return strValue.replace( strWhitespace,
      strQuoteDb+ strWhitespace+"and"+strWhitespace+strQuoteDb);
   }
  }
 }
 
 private String checkKeyword(String strValue)
 {
  strValue = checkKeywordComponent( strValue, OPER_AND );
  strValue = checkKeywordComponent( strValue, OPER_OR );
  strValue = checkKeywordComponent( strValue, OPER_NOT );
  return strValue;
 }
  
 //add quotes to the head and tail of the string 
 //the strValue and keyword must be lowercase
 private String checkKeywordComponent(String strValue, String keyword)
 { 
  StringBuffer sb = new StringBuffer();
  sb.append( strValue );
  
  int nIndex = sb.indexOf( keyword );
  int offset = keyword.length();
  
  String strQuoteDb = "\"";
  while( nIndex >=0 )
  {
   //check the quote   
   if( !sb.substring( nIndex-1, nIndex).equals( strQuoteDb ))
   {
    sb.insert( nIndex, strQuoteDb );
    offset++;
   }   
   if( !sb.substring( nIndex+offset, nIndex+offset+1).equals( strQuoteDb ))
   {
    sb.insert( nIndex+offset, strQuoteDb );
   }
   nIndex = sb.indexOf(keyword, nIndex+2 );
   offset = keyword.length();
  }
  
  return sb.toString();
 }
 
 private String splitString(String strValue)
 { 
  //clear the whitespace of head and tail
  strValue = strValue.trim();
  //continued whitespace to one 
  strValue = strValue.replaceAll("\\s\\s+", " ");
  //add quotes for operator: and ,or ,not
  strValue = checkKeyword( strValue );
  
  String strQuoteDb = "\"";
    
  StringBuffer sb = new StringBuffer();
  int nStartIndex = 0;
  int nFirstIndex = strValue.indexOf( strQuoteDb );
  
  while( nFirstIndex>=0 )
  { 
   sb.append( strValue.substring( nStartIndex, nFirstIndex+1 ) );
   
   int nSecondQuote = strValue.indexOf(strQuoteDb, nFirstIndex+1 );
   
   nStartIndex = (nFirstIndex<strValue.length()-1)? nFirstIndex+1 : strValue.length();
   if( nSecondQuote<0 )  //the last quote  not exist
   {
    String strLast = strValue.substring(nStartIndex, strValue.length() );
    strLast = MainUtil.splitWord( strLast );
    sb.append( strLast );
    sb.append( strQuoteDb );
    nStartIndex = strValue.length()-1;
    break;
   }
   else
   {
    String strLast = strValue.substring( nStartIndex, nSecondQuote );
    strLast = MainUtil.splitWord( strLast );
    sb.append( strLast );
    sb.append( strQuoteDb );
    nStartIndex = nSecondQuote+1;
   }
   //find the third "
   nFirstIndex = strValue.indexOf( strQuoteDb, nStartIndex );
  }
  
  if( nStartIndex+1 < strValue.length() )
  {
   sb.append( strValue.substring(nStartIndex+1));
  }  
  return sb.toString();
 } 
 You can have a test.
 
 4. COMMIT?
 who can commit this to GN source?
 Or how can i commit this ?


雅虎邮箱,您的终生邮箱!
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@...
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@...
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Re: The solution to search related problems: operators, quotes, phrase, chinese.

by zhuhuazha :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

the 4th problem is chinese related.
In english one word is a word or term, like "helllo
body", it is two words. when we use StandardAnalyzer,
like MainUtil.splitWord, the result will be " hello
body ". "hello" still "hello", "body" still "body".
But in chinese "XX YY", "X" "Y" stand for one
character, the result will be " X X Y Y ", so "XX"
result in " X X ". but we need the phrase "XX" indeed,
so we need the phrase query in lucene.

> hi Zhuhua,
>
> as for your problems 1 - 3, you're right, even
> though the documentation
> states that "or", "not" and "phrase" operators in
> queries are supported,
> they are not. If we'd like these operators in that
> way (i.e. operating in a
> query from a single search field), we should use a
> Lucene QueryParser, and
> we don't.
>
> However some time ago I alternatively implemented
> "or", "without" and
> "phrase" queries by adding extra search fields for
> each of these -- kind of
> like Google's advanced search page where you also
> have separate input fields
> for this. These fields are normally "hidden"
> (invisible) in the advanced
> search section; if you set their "display" property
> to either "inline" or
> "block", you should be able to use them straight
> away.
>
> As for your 4th problem, I'm not entirely sure what
> you mean..
>
> Kind regards
> Heikki Doeleman
>
>
>

      ___________________________________________________________
 雅虎邮箱,您的终生邮箱!
http://cn.mail.yahoo.com/



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@...
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork

Re: The solution to search related problems: operators, quotes, phrase, chinese.

by heikki doeleman :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Okay,

if this is the case I think we maybe better start using a Lucene QueryParser, which I think probably should handle multi-lingual cases better than that MainUtil.splitWord thing that we have, which is somewhat unfortunate anyway.

Do you, or anyone on this list have any experience in using Lucene QueryParsers with languages in non-western writing like Chinese ?

Certainly this issue must have come up with users of the very popular Lucene search library... let's look for a solution that's already there, is my opinion.

Did my answer to your problems 1-3 address those issues, or is everything blocked by your problem 4 ?

Are there no other implementations of GeoNetwork in Chinese ? And if anyone has one, how do you solve this problem ?

Kind regards
Heikki Doeleman



On Tue, Sep 9, 2008 at 5:30 PM, zhuhua zha <zhuhuazha2004@...> wrote:
the 4th problem is chinese related.
In english one word is a word or term, like "helllo
body", it is two words. when we use StandardAnalyzer,
like MainUtil.splitWord, the result will be " hello
body ". "hello" still "hello", "body" still "body".
But in chinese "XX YY", "X" "Y" stand for one
character, the result will be " X X Y Y ", so "XX"
result in " X X ". but we need the phrase "XX" indeed,
so we need the phrase query in lucene.

> hi Zhuhua,
>
> as for your problems 1 - 3, you're right, even
> though the documentation
> states that "or", "not" and "phrase" operators in
> queries are supported,
> they are not. If we'd like these operators in that
> way (i.e. operating in a
> query from a single search field), we should use a
> Lucene QueryParser, and
> we don't.
>
> However some time ago I alternatively implemented
> "or", "without" and
> "phrase" queries by adding extra search fields for
> each of these -- kind of
> like Google's advanced search page where you also
> have separate input fields
> for this. These fields are normally "hidden"
> (invisible) in the advanced
> search section; if you set their "display" property
> to either "inline" or
> "block", you should be able to use them straight
> away.
>
> As for your 4th problem, I'm not entirely sure what
> you mean..
>
> Kind regards
> Heikki Doeleman
>
>
>


     ___________________________________________________________
 雅虎邮箱,您的终生邮箱!
http://cn.mail.yahoo.com/



-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
GeoNetwork-devel mailing list
GeoNetwork-devel@...
https://lists.sourceforge.net/lists/listinfo/geonetwork-devel
GeoNetwork OpenSource is maintained at http://sourceforge.net/projects/geonetwork
LightInTheBox - Buy quality products at wholesale price!