Check out OpenNLP, I use it's sentence splitter in my bot and it is the
best I have seen. You can find more information at
http://opennlp.sourceforge.net/. If you want to start with HTML, then
use an XSLT transform something like:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0"
xmlns:xsl="
http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match=text()>
<xsl:value-of select="."/>
</xsl:template>
</xsl:stylesheet>
(sorry, I just free handed the xslt and didn't test it, but it should
be close). Xslt is the way to go to strip html out. I recommend using
OpenNLP to get the sentences after that.
_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer