The wildcard-filling problem

View: New views
8 Messages — Rating Filter:   Alert me  

The wildcard-filling problem

by Helio Perroni Filho :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

One of the major improvements in the last ChatterBean
release was a sounder strategy to normalize sentences,
keeping information that is later used to determine
the value of matched pattern wildcards. In some ways
this remains an open problem; so, as I wipe the dust
and get the project running again, it seems a good
idea to bring it under discussion. The
"wildcard-filling problem", as I call it, can be
enunciated like this:

When a sentence is normalized, a great deal of
information is lost: type case, accentuation,
punctuation and other non-alphanumeric characters are
all wiped away. While this makes pattern-matching a
lot easier, it also presents a problem when it comes
to pattern wildcards. We want wildcards to be filled
with fragments from the original, unformatted
sentence, and we want those fragments to be
"equivalent" to the fragments of the normalized
version that were actually matched. So, for example,
if we have a category like this:

<category>
  <pattern>I AM *</pattern>
  <template>Nice to meet you, <star/>.</template>
</category>

We want the bot to behave like this:

User: I am Jean-Marie.
Bot: Nice to meet you, Jean-Marie.

Instead of this:

User: I am Jean-Marie.
Bot: Nice to meet you, JEAN MARIE.

Our problem, then, is how to relate a fragment of a
normalized sentence to its original form. In some way
or another, we need to keep track of the
transformations a sentence goes through, and use this
information to find our way back from the normalized
sentence to its originator.

So far, my solution has been to transform the original
sentence in an object with three attributes:

* A quasi-original sentence, with no extra spaces or
termination punctuation;
* The normalized sentence;
* A list of "mappings", integer indexes relating
portions of the normalized sentence to the equivalents
on its original form.

So, if we have a sentence string:

"D'you know wher   my um brella is?"

And a set of substitutions:

((" D'you ", " Do you "),
 (" wher ", " where "),
 (" um brella ", " umbrella "))

After applying the substitutions and normalizations to
the sentence, I would have an object like this:

(" D'you know wher my um brella is ",
 (1, NULL, 7, 12, 17, 20, 30, 33)
 " DO YOU KNOW WHERE MY UMBRELLA IS ")
 
Now what does that integer list mean, and how did I
get it?

The mapping list is created after the quasi-original
sentence, containing the indexes of the space
characters in the senence string. During
normalization, the list is modified according to three
rules:

* If a substitution removes a space character from the
sentence string, its corresponding index is also
removed from the list;
* If a substitution adds a space character to the
sentence string, the special value NULL is inserted
between the indexes of the space characters that
surrounded the original, unseparated string fragment;
* Otherwise, no change is made to the list.

Note that, in the end, the mapping list will have as
many entries as there are spaces in the normalized
sentence.

Later, when a wildcard is matched by a fragment of a
normalized sentence, the equivalent unformatted
fragment is recovered as follows:

1. Let m be the number of spaces from the beginning of
the normalized sentence to the beginning of the
matched fragment, and n be the number of spaces from
the beginning of the normalized sentence to the end of
the matched fragment;
2. If mappings[m] is non-NULL, then make i =
mappings[m], where i is the beginning of the matched
fragment in the quasi-original sentence. Otherwise,
subtract one from the value of m and check condition 2
again;
3. If mappings[n] is non-NULL, then make j =
mappings[n], where j is the end of the matched
fragment in the quasi-original sentence. Otherwise,
add one to the value of m and check condition 3 again;
4. Return substring(quasi_original, i, j).

Let's say our example original sentence is matched by
a pattern like this:

DO * KNOW WHERE MY * IS

The wildcards match the fragments " YOU " and "
UMBRELLA ".

* Following the algorithm above for the fragment " YOU
", we get:

1. m = 2 and n = 3;
2. mappings[2] = NULL; mappings[2 - 1] = 1;
3. mappings[3] = 7;
4. substring(quasi_original, 1, 7) = " D'you ".

* Following the algorithm above for the fragment "
UMBRELLA ", we get:

1. m = 6 and n = 7;
2. mappings[6] = 20;
3. mappings[7] = 30;
4. substring(quasi_original, 20, 30) = " um brella ".

The main problem with this solution is how to get to
the object from the original sentence string. Keeping
the mapping list in sync with the normalization
process can be a real pain, not to mention
performance-costly; besides, I have found no way to
correctly update the list when a substitution adds
more than one space to the sentence (on the up side, I
could not think on a good example of this, so it must
be a pretty rare occurrence). Finally, the creation
rules are promptly defeated by space groups on the
input string, thus creating the need of a
pre-processing step -- and special care must be taken
not to add extra spaces with each substitution.

Despite these lower-level problems, I sill think that
on a higher abstract level my solution is relatively
simple and ellegant, and that in time I will be able
to iron out any implementation shortcomings. But of
course, if anyone knows of a simpler and/or more
efficient alternative, I would be glad to know.

--
Ja mata ne.
Helio Perroni Filho


       



       
               
_______________________________________________________
Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora!
www.yahoo.com.br/messenger/
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Gary Poster :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 4, 2005, at 7:49 AM, Helio Perroni Filho wrote:
...

> The main problem with this solution is how to get to
> the object from the original sentence string. Keeping
> the mapping list in sync with the normalization
> process can be a real pain, not to mention
> performance-costly; besides, I have found no way to
> correctly update the list when a substitution adds
> more than one space to the sentence (on the up side, I
> could not think on a good example of this, so it must
> be a pretty rare occurrence). Finally, the creation
> rules are promptly defeated by space groups on the
> input string, thus creating the need of a
> pre-processing step -- and special care must be taken
> not to add extra spaces with each substitution.
>
> Despite these lower-level problems, I sill think that
> on a higher abstract level my solution is relatively
> simple and ellegant, and that in time I will be able
> to iron out any implementation shortcomings. But of
> course, if anyone knows of a simpler and/or more
> efficient alternative, I would be glad to know.

I have some unreleased Python code that tackled this problem, among  
many others.  I tackled so many things that I might have been too  
complex, especially for the simplicity that AIML tries to promote.  
In any case, a quick breezy summary of my approach was this.

I tokenized--so I stopped caring about whitespace, which I think you  
are saying you care about also--into objects, without the mapping you  
are talking about.  Then each substitution token also knows about the  
tokens that comprised it.  Then wildcards that didn't use substituted  
tokens got the original tokens.  It seemed to work rather well, at  
least on the basis of my unit tests.

I still haven't gone the last mile to make it all into a full-fledged  
demo well and hook it up, though, a year and a half later.

Gary
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Helio Perroni Filho :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- Gary Poster <gary@...> escreveu:

> I have some unreleased Python code that tackled this
> problem, among many others. (...)
>
> I tokenized -- so I stopped caring about whitespace,
> which I think you are saying you care about also --
> into objects, without the mapping you are talking
> about.

That's interesting, but how do you deal with the case
where an element in the original sentence is split in
two -- for example, changing "he's" to "he is"?

--
Ja mata ne.
Helio Perroni Filho



       



       
               
_______________________________________________________
Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora!
www.yahoo.com.br/messenger/
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Gary Poster :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 4, 2005, at 6:30 PM, Helio Perroni Filho wrote:

> --- Gary Poster <gary@...> escreveu:
>
>
>> I have some unreleased Python code that tackled this
>> problem, among many others. (...)
>>
>> I tokenized -- so I stopped caring about whitespace,
>> which I think you are saying you care about also --
>> into objects, without the mapping you are talking
>> about.
>>
>
> That's interesting, but how do you deal with the case
> where an element in the original sentence is split in
> two -- for example, changing "he's" to "he is"?

Heh, looks like my memory was a bit off.  Here's an excerpt from the  
pertinent part of the docs:

----8<----

Sentence
========
a tokenized "sentence" is a tuple subclass of tuples.  Each
composite tuple is called a "token" in the AIMLes docs and is  
comprised of
four elements:

- the tokenized value;
- the tokenized type;
- the source that generated this value, if it is the first token
   generated from the source; and
- the total number of tokens generated from this particular source.

Sentences also have two special attributes: memory and  
nextSentences.  Memory
is a dictionary of predicate memory values that should be applied if  
this
sentence is used.  nextSentences is a read-only (and lazily calculated)
attribute that  returns an iterable of the next sentence  
possibilities after
this one, or None.

----8<----

Thus, given this input:

   >>> source = ("howdy pardner! Whaddya think of this "
   ... "picture <img src='http://example.com/my_teddy_bear.png' />?")


The tests come up with two competing (scored) substitutions (one that  
tries a typed 'image' match,  generated from an XML parse, and one  
that elides it); an expansion of "whaddya" to "what do you"; and a  
compression of 'howdy pardner' to 'hi'.  Note that white space in the  
source value (the third element of each tuple) in fact is not an  
object and does honor *normalized* whitespace--both contrary to what  
I said before.  Also note that case and normalized white space is  
intentionally maintained in the tokenized value (the first element of  
each tuple), although some of the substitutions were not as careful  
as they should have been.  That is normalized only if necessary later  
on.  My general approach was that I wanted to allow accurate matches  
if they were available, at the potential cost of possibly non-trivial  
additional work for the system.

My doctests come out with this output, then:

   >>> pp = pprint.PrettyPrinter(width=65).pprint
   >>> pp([tuple(sentence) for sentence in sentences])
   [((u'hi', None, u'howdy pardner', 1),
     (u'! ', None, u'! ', 1),
     (u'what ', None, u'Whaddya ', 3),
     (u'do ', None, None, 3),
     (u'you', None, None, 3),
     (u'think ', None, u'think ', 1),
     (u'of ', None, u'of ', 1),
     (u'this ', None, u'this ', 1),
     (u'picture ', None, u'picture ', 1),
     (u'http://example.com/my_teddy_bear.png', u'image', u'', 1),
     (u'?', None, u'?', 1)),
    ((u'hi', None, u'howdy pardner', 1),
     (u'! ', None, u'! ', 1),
     (u'what ', None, u'Whaddya ', 3),
     (u'do ', None, None, 3),
     (u'you', None, None, 3),
     (u'think ', None, u'think ', 1),
     (u'of ', None, u'of ', 1),
     (u'this ', None, u'this ', 1),
     (u'picture ', None, u'picture ', 1),
     (u'?', None, u'?', 1))]

I'm not showing other aspects of the sentence, but that's a start.  
There's a *lot* more to it.  I was a tad ambitious.  ;-)  All that's  
left to give it a whirl is to hook up some valid, well-formed AIML to  
it.  The GPL license of Dr. Wallace's files meant I could never  
really use my code for work, so I put it aside for now.  Maybe I'll  
pick it up later.

Gary
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Helio Perroni Filho :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- Gary Poster <gary@...> escreveu:

>> That's interesting, but how do you deal with the
>> case where an element in the original sentence is
>> split in two -- for example, changing "he's" to
>> "he is"?
>
> Heh, looks like my memory was a bit off. Here's an
> excerpt from the pertinent part of the docs:

Now I get it... I think. ^_^' I presume a given
sentence token is never going to be "substituted" more
than once, right? Because I was looking for a solution
that allowed the user to line up substitutions, so he
could create a configuration like:

((" waht", " what"),
 (" what's ", " what is "))

An the bot would correctly normalize a sentence

"Waht's the matter?"

into

"WHAT IS THE MATTER"

> The GPL license of Dr. Wallace's files meant I
> could never really use my code for work, so I put it
> aside for now.

That's certainly true if you used GPL'ed code on your
work, but if you simply created a "clean room"
implementation of the AIML standard, I'm not sure the
GPL applies. Actually the AIML working draft makes no
mention to licences at all, so I guess it is an open
standard, which does not rule out closed-source implementations.


       



       
               
_______________________________________________________
Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora!
www.yahoo.com.br/messenger/
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Helio Perroni Filho :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

--- Gary Poster <gary@...> escreveu:

>> That's interesting, but how do you deal with the
>> case where an element in the original sentence is
>> split in two -- for example, changing "he's" to
>> "he is"?
>
> Heh, looks like my memory was a bit off. Here's an
> excerpt from the pertinent part of the docs:

Now I get it... I think. ^_^' I presume a given
sentence token is never going to be "substituted" more
than once, right? Because I was looking for a solution
that allowed the user to line up substitutions, so he
could create a configuration like:

((" waht", " what"),
 (" what's ", " what is "))

An the bot would correctly normalize a sentence

"Waht's the matter?"

into

"WHAT IS THE MATTER"

> The GPL license of Dr. Wallace's files meant I
> could never really use my code for work, so I put it
> aside for now.

That's certainly true if you used GPL'ed code on your
work, but if you simply created a "clean room"
implementation of the AIML standard, I'm not sure the
GPL applies. Actually the AIML working draft makes no
mention to licences at all, so I guess it is an open
standard, which does not rule out closed-source
implementations.

--
Ja mata ne.
Helio Perroni Filho



       


       
               
_______________________________________________________
Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora!
www.yahoo.com.br/messenger/
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Dr. Rich Wallace :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There are several proprietary AIML interpreters.  AFAIK know there is no
inconsistency between distributing GPL AIML knowledge bases along with
proprietary AIML interpreters.  Seems like a good business model to me.

> --- Gary Poster <gary@...> escreveu:
>
>>> That's interesting, but how do you deal with the
>>> case where an element in the original sentence is
>>> split in two -- for example, changing "he's" to
>>> "he is"?
>>
>> Heh, looks like my memory was a bit off. Here's an
>> excerpt from the pertinent part of the docs:
>
> Now I get it... I think. ^_^' I presume a given
> sentence token is never going to be "substituted" more
> than once, right? Because I was looking for a solution
> that allowed the user to line up substitutions, so he
> could create a configuration like:
>
> ((" waht", " what"),
>  (" what's ", " what is "))
>
> An the bot would correctly normalize a sentence
>
> "Waht's the matter?"
>
> into
>
> "WHAT IS THE MATTER"
>
>> The GPL license of Dr. Wallace's files meant I
>> could never really use my code for work, so I put it
>> aside for now.
>
> That's certainly true if you used GPL'ed code on your
> work, but if you simply created a "clean room"
> implementation of the AIML standard, I'm not sure the
> GPL applies. Actually the AIML working draft makes no
> mention to licences at all, so I guess it is an open
> standard, which does not rule out closed-source implementations.
>
>
>
>
>
>
>
>
> _______________________________________________________
> Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons
> e muito mais. Instale agora!  www.yahoo.com.br/messenger/
> _______________________________________________
> alicebot-developer mailing list
> alicebot-developer@...
> http://list.alicebot.org/mailman/listinfo/alicebot-developer


--
Dr. Rich
W A L L A C E
ALICE A.I. Foundation
drwallace@...
Winner, Loebner Prize 2000, 2001, 2004


_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

Re: The wildcard-filling problem

by Gary Poster :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


On Oct 6, 2005, at 8:18 AM, Helio Perroni Filho wrote:

> --- Gary Poster <gary@...> escreveu:
>
>
>>> That's interesting, but how do you deal with the
>>> case where an element in the original sentence is
>>> split in two -- for example, changing "he's" to
>>> "he is"?
>>>
>>
>> Heh, looks like my memory was a bit off. Here's an
>> excerpt from the pertinent part of the docs:
>>
>
> Now I get it... I think. ^_^' I presume a given
> sentence token is never going to be "substituted" more
> than once, right? Because I was looking for a solution
> that allowed the user to line up substitutions, so he
> could create a configuration like:
>
> ((" waht", " what"),
>  (" what's ", " what is "))
>
> An the bot would correctly normalize a sentence
>
> "Waht's the matter?"
>
> into
>
> "WHAT IS THE MATTER"

Ah, yes, good idea.  No, I didn't contemplate that.

>> The GPL license of Dr. Wallace's files meant I
>> could never really use my code for work, so I put it
>> aside for now.
>>
>
> That's certainly true if you used GPL'ed code on your
> work, but if you simply created a "clean room"
> implementation of the AIML standard, I'm not sure the
> GPL applies. Actually the AIML working draft makes no
> mention to licences at all, so I guess it is an open
> standard, which does not rule out closed-source
> implementations.

True (and thanks also to Dr. Wallace for his similar reply).  Mine  
was definitely a clean room implementation of an interpreter, but the  
main *practical* value of AIML, in my opinion, is Dr. Wallace's very  
large database.  I could conceive of other designs for the basic AIML  
ideas, but this is a very reasonable one, and it has data.  If I  
didn't want to use Dr. Wallace's database, I would only use AIML for  
inspiration, and not care much if I actually could call the project  
an "AIML engine".

So, AIML is a gift from Dr. Wallace that I appreciate, but I can't  
use it at the moment the ways I wanted to.  I'm not sure I could  
honestly claim to have written a "clean room" version of the basic  
normalizing substitutions, since I've looked at the GPL set; I'm also  
not sure if I really want to spend the time duplicating effort in a  
more friendly license.

Gary
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer