|
View:
New views
8 Messages
—
Rating Filter:
Alert me
|
|
|
The wildcard-filling problemOne of the major improvements in the last ChatterBean
release was a sounder strategy to normalize sentences, keeping information that is later used to determine the value of matched pattern wildcards. In some ways this remains an open problem; so, as I wipe the dust and get the project running again, it seems a good idea to bring it under discussion. The "wildcard-filling problem", as I call it, can be enunciated like this: When a sentence is normalized, a great deal of information is lost: type case, accentuation, punctuation and other non-alphanumeric characters are all wiped away. While this makes pattern-matching a lot easier, it also presents a problem when it comes to pattern wildcards. We want wildcards to be filled with fragments from the original, unformatted sentence, and we want those fragments to be "equivalent" to the fragments of the normalized version that were actually matched. So, for example, if we have a category like this: <category> <pattern>I AM *</pattern> <template>Nice to meet you, <star/>.</template> </category> We want the bot to behave like this: User: I am Jean-Marie. Bot: Nice to meet you, Jean-Marie. Instead of this: User: I am Jean-Marie. Bot: Nice to meet you, JEAN MARIE. Our problem, then, is how to relate a fragment of a normalized sentence to its original form. In some way or another, we need to keep track of the transformations a sentence goes through, and use this information to find our way back from the normalized sentence to its originator. So far, my solution has been to transform the original sentence in an object with three attributes: * A quasi-original sentence, with no extra spaces or termination punctuation; * The normalized sentence; * A list of "mappings", integer indexes relating portions of the normalized sentence to the equivalents on its original form. So, if we have a sentence string: "D'you know wher my um brella is?" And a set of substitutions: ((" D'you ", " Do you "), (" wher ", " where "), (" um brella ", " umbrella ")) After applying the substitutions and normalizations to the sentence, I would have an object like this: (" D'you know wher my um brella is ", (1, NULL, 7, 12, 17, 20, 30, 33) " DO YOU KNOW WHERE MY UMBRELLA IS ") Now what does that integer list mean, and how did I get it? The mapping list is created after the quasi-original sentence, containing the indexes of the space characters in the senence string. During normalization, the list is modified according to three rules: * If a substitution removes a space character from the sentence string, its corresponding index is also removed from the list; * If a substitution adds a space character to the sentence string, the special value NULL is inserted between the indexes of the space characters that surrounded the original, unseparated string fragment; * Otherwise, no change is made to the list. Note that, in the end, the mapping list will have as many entries as there are spaces in the normalized sentence. Later, when a wildcard is matched by a fragment of a normalized sentence, the equivalent unformatted fragment is recovered as follows: 1. Let m be the number of spaces from the beginning of the normalized sentence to the beginning of the matched fragment, and n be the number of spaces from the beginning of the normalized sentence to the end of the matched fragment; 2. If mappings[m] is non-NULL, then make i = mappings[m], where i is the beginning of the matched fragment in the quasi-original sentence. Otherwise, subtract one from the value of m and check condition 2 again; 3. If mappings[n] is non-NULL, then make j = mappings[n], where j is the end of the matched fragment in the quasi-original sentence. Otherwise, add one to the value of m and check condition 3 again; 4. Return substring(quasi_original, i, j). Let's say our example original sentence is matched by a pattern like this: DO * KNOW WHERE MY * IS The wildcards match the fragments " YOU " and " UMBRELLA ". * Following the algorithm above for the fragment " YOU ", we get: 1. m = 2 and n = 3; 2. mappings[2] = NULL; mappings[2 - 1] = 1; 3. mappings[3] = 7; 4. substring(quasi_original, 1, 7) = " D'you ". * Following the algorithm above for the fragment " UMBRELLA ", we get: 1. m = 6 and n = 7; 2. mappings[6] = 20; 3. mappings[7] = 30; 4. substring(quasi_original, 20, 30) = " um brella ". The main problem with this solution is how to get to the object from the original sentence string. Keeping the mapping list in sync with the normalization process can be a real pain, not to mention performance-costly; besides, I have found no way to correctly update the list when a substitution adds more than one space to the sentence (on the up side, I could not think on a good example of this, so it must be a pretty rare occurrence). Finally, the creation rules are promptly defeated by space groups on the input string, thus creating the need of a pre-processing step -- and special care must be taken not to add extra spaces with each substitution. Despite these lower-level problems, I sill think that on a higher abstract level my solution is relatively simple and ellegant, and that in time I will be able to iron out any implementation shortcomings. But of course, if anyone knows of a simpler and/or more efficient alternative, I would be glad to know. -- Ja mata ne. Helio Perroni Filho _______________________________________________________ Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora! www.yahoo.com.br/messenger/ _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problemOn Oct 4, 2005, at 7:49 AM, Helio Perroni Filho wrote: ... > The main problem with this solution is how to get to > the object from the original sentence string. Keeping > the mapping list in sync with the normalization > process can be a real pain, not to mention > performance-costly; besides, I have found no way to > correctly update the list when a substitution adds > more than one space to the sentence (on the up side, I > could not think on a good example of this, so it must > be a pretty rare occurrence). Finally, the creation > rules are promptly defeated by space groups on the > input string, thus creating the need of a > pre-processing step -- and special care must be taken > not to add extra spaces with each substitution. > > Despite these lower-level problems, I sill think that > on a higher abstract level my solution is relatively > simple and ellegant, and that in time I will be able > to iron out any implementation shortcomings. But of > course, if anyone knows of a simpler and/or more > efficient alternative, I would be glad to know. I have some unreleased Python code that tackled this problem, among many others. I tackled so many things that I might have been too complex, especially for the simplicity that AIML tries to promote. In any case, a quick breezy summary of my approach was this. I tokenized--so I stopped caring about whitespace, which I think you are saying you care about also--into objects, without the mapping you are talking about. Then each substitution token also knows about the tokens that comprised it. Then wildcards that didn't use substituted tokens got the original tokens. It seemed to work rather well, at least on the basis of my unit tests. I still haven't gone the last mile to make it all into a full-fledged demo well and hook it up, though, a year and a half later. Gary _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problem--- Gary Poster <gary@...> escreveu:
> I have some unreleased Python code that tackled this > problem, among many others. (...) > > I tokenized -- so I stopped caring about whitespace, > which I think you are saying you care about also -- > into objects, without the mapping you are talking > about. That's interesting, but how do you deal with the case where an element in the original sentence is split in two -- for example, changing "he's" to "he is"? -- Ja mata ne. Helio Perroni Filho _______________________________________________________ Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora! www.yahoo.com.br/messenger/ _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problemOn Oct 4, 2005, at 6:30 PM, Helio Perroni Filho wrote: > --- Gary Poster <gary@...> escreveu: > > >> I have some unreleased Python code that tackled this >> problem, among many others. (...) >> >> I tokenized -- so I stopped caring about whitespace, >> which I think you are saying you care about also -- >> into objects, without the mapping you are talking >> about. >> > > That's interesting, but how do you deal with the case > where an element in the original sentence is split in > two -- for example, changing "he's" to "he is"? Heh, looks like my memory was a bit off. Here's an excerpt from the pertinent part of the docs: ----8<---- Sentence ======== a tokenized "sentence" is a tuple subclass of tuples. Each composite tuple is called a "token" in the AIMLes docs and is comprised of four elements: - the tokenized value; - the tokenized type; - the source that generated this value, if it is the first token generated from the source; and - the total number of tokens generated from this particular source. Sentences also have two special attributes: memory and nextSentences. Memory is a dictionary of predicate memory values that should be applied if this sentence is used. nextSentences is a read-only (and lazily calculated) attribute that returns an iterable of the next sentence possibilities after this one, or None. ----8<---- Thus, given this input: >>> source = ("howdy pardner! Whaddya think of this " ... "picture <img src='http://example.com/my_teddy_bear.png' />?") The tests come up with two competing (scored) substitutions (one that tries a typed 'image' match, generated from an XML parse, and one that elides it); an expansion of "whaddya" to "what do you"; and a compression of 'howdy pardner' to 'hi'. Note that white space in the source value (the third element of each tuple) in fact is not an object and does honor *normalized* whitespace--both contrary to what I said before. Also note that case and normalized white space is intentionally maintained in the tokenized value (the first element of each tuple), although some of the substitutions were not as careful as they should have been. That is normalized only if necessary later on. My general approach was that I wanted to allow accurate matches if they were available, at the potential cost of possibly non-trivial additional work for the system. My doctests come out with this output, then: >>> pp = pprint.PrettyPrinter(width=65).pprint >>> pp([tuple(sentence) for sentence in sentences]) [((u'hi', None, u'howdy pardner', 1), (u'! ', None, u'! ', 1), (u'what ', None, u'Whaddya ', 3), (u'do ', None, None, 3), (u'you', None, None, 3), (u'think ', None, u'think ', 1), (u'of ', None, u'of ', 1), (u'this ', None, u'this ', 1), (u'picture ', None, u'picture ', 1), (u'http://example.com/my_teddy_bear.png', u'image', u'', 1), (u'?', None, u'?', 1)), ((u'hi', None, u'howdy pardner', 1), (u'! ', None, u'! ', 1), (u'what ', None, u'Whaddya ', 3), (u'do ', None, None, 3), (u'you', None, None, 3), (u'think ', None, u'think ', 1), (u'of ', None, u'of ', 1), (u'this ', None, u'this ', 1), (u'picture ', None, u'picture ', 1), (u'?', None, u'?', 1))] I'm not showing other aspects of the sentence, but that's a start. There's a *lot* more to it. I was a tad ambitious. ;-) All that's left to give it a whirl is to hook up some valid, well-formed AIML to it. The GPL license of Dr. Wallace's files meant I could never really use my code for work, so I put it aside for now. Maybe I'll pick it up later. Gary _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problem--- Gary Poster <gary@...> escreveu:
>> That's interesting, but how do you deal with the >> case where an element in the original sentence is >> split in two -- for example, changing "he's" to >> "he is"? > > Heh, looks like my memory was a bit off. Here's an > excerpt from the pertinent part of the docs: Now I get it... I think. ^_^' I presume a given sentence token is never going to be "substituted" more than once, right? Because I was looking for a solution that allowed the user to line up substitutions, so he could create a configuration like: ((" waht", " what"), (" what's ", " what is ")) An the bot would correctly normalize a sentence "Waht's the matter?" into "WHAT IS THE MATTER" > The GPL license of Dr. Wallace's files meant I > could never really use my code for work, so I put it > aside for now. That's certainly true if you used GPL'ed code on your work, but if you simply created a "clean room" implementation of the AIML standard, I'm not sure the GPL applies. Actually the AIML working draft makes no mention to licences at all, so I guess it is an open standard, which does not rule out closed-source implementations. _______________________________________________________ Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora! www.yahoo.com.br/messenger/ _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problem--- Gary Poster <gary@...> escreveu:
>> That's interesting, but how do you deal with the >> case where an element in the original sentence is >> split in two -- for example, changing "he's" to >> "he is"? > > Heh, looks like my memory was a bit off. Here's an > excerpt from the pertinent part of the docs: Now I get it... I think. ^_^' I presume a given sentence token is never going to be "substituted" more than once, right? Because I was looking for a solution that allowed the user to line up substitutions, so he could create a configuration like: ((" waht", " what"), (" what's ", " what is ")) An the bot would correctly normalize a sentence "Waht's the matter?" into "WHAT IS THE MATTER" > The GPL license of Dr. Wallace's files meant I > could never really use my code for work, so I put it > aside for now. That's certainly true if you used GPL'ed code on your work, but if you simply created a "clean room" implementation of the AIML standard, I'm not sure the GPL applies. Actually the AIML working draft makes no mention to licences at all, so I guess it is an open standard, which does not rule out closed-source implementations. -- Ja mata ne. Helio Perroni Filho _______________________________________________________ Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons e muito mais. Instale agora! www.yahoo.com.br/messenger/ _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problemThere are several proprietary AIML interpreters. AFAIK know there is no
inconsistency between distributing GPL AIML knowledge bases along with proprietary AIML interpreters. Seems like a good business model to me. > --- Gary Poster <gary@...> escreveu: > >>> That's interesting, but how do you deal with the >>> case where an element in the original sentence is >>> split in two -- for example, changing "he's" to >>> "he is"? >> >> Heh, looks like my memory was a bit off. Here's an >> excerpt from the pertinent part of the docs: > > Now I get it... I think. ^_^' I presume a given > sentence token is never going to be "substituted" more > than once, right? Because I was looking for a solution > that allowed the user to line up substitutions, so he > could create a configuration like: > > ((" waht", " what"), > (" what's ", " what is ")) > > An the bot would correctly normalize a sentence > > "Waht's the matter?" > > into > > "WHAT IS THE MATTER" > >> The GPL license of Dr. Wallace's files meant I >> could never really use my code for work, so I put it >> aside for now. > > That's certainly true if you used GPL'ed code on your > work, but if you simply created a "clean room" > implementation of the AIML standard, I'm not sure the > GPL applies. Actually the AIML working draft makes no > mention to licences at all, so I guess it is an open > standard, which does not rule out closed-source implementations. > > > > > > > > > _______________________________________________________ > Novo Yahoo! Messenger com voz: ligações, Yahoo! Avatars, novos emoticons > e muito mais. Instale agora! www.yahoo.com.br/messenger/ > _______________________________________________ > alicebot-developer mailing list > alicebot-developer@... > http://list.alicebot.org/mailman/listinfo/alicebot-developer -- Dr. Rich W A L L A C E ALICE A.I. Foundation drwallace@... Winner, Loebner Prize 2000, 2001, 2004 _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
|
|
Re: The wildcard-filling problemOn Oct 6, 2005, at 8:18 AM, Helio Perroni Filho wrote: > --- Gary Poster <gary@...> escreveu: > > >>> That's interesting, but how do you deal with the >>> case where an element in the original sentence is >>> split in two -- for example, changing "he's" to >>> "he is"? >>> >> >> Heh, looks like my memory was a bit off. Here's an >> excerpt from the pertinent part of the docs: >> > > Now I get it... I think. ^_^' I presume a given > sentence token is never going to be "substituted" more > than once, right? Because I was looking for a solution > that allowed the user to line up substitutions, so he > could create a configuration like: > > ((" waht", " what"), > (" what's ", " what is ")) > > An the bot would correctly normalize a sentence > > "Waht's the matter?" > > into > > "WHAT IS THE MATTER" Ah, yes, good idea. No, I didn't contemplate that. >> The GPL license of Dr. Wallace's files meant I >> could never really use my code for work, so I put it >> aside for now. >> > > That's certainly true if you used GPL'ed code on your > work, but if you simply created a "clean room" > implementation of the AIML standard, I'm not sure the > GPL applies. Actually the AIML working draft makes no > mention to licences at all, so I guess it is an open > standard, which does not rule out closed-source > implementations. True (and thanks also to Dr. Wallace for his similar reply). Mine was definitely a clean room implementation of an interpreter, but the main *practical* value of AIML, in my opinion, is Dr. Wallace's very large database. I could conceive of other designs for the basic AIML ideas, but this is a very reasonable one, and it has data. If I didn't want to use Dr. Wallace's database, I would only use AIML for inspiration, and not care much if I actually could call the project an "AIML engine". So, AIML is a gift from Dr. Wallace that I appreciate, but I can't use it at the moment the ways I wanted to. I'm not sure I could honestly claim to have written a "clean room" version of the basic normalizing substitutions, since I've looked at the GPL set; I'm also not sure if I really want to spend the time duplicating effort in a more friendly license. Gary _______________________________________________ alicebot-developer mailing list alicebot-developer@... http://list.alicebot.org/mailman/listinfo/alicebot-developer |
| Free Forum Powered by Nabble | Forum Help |