« Return to Thread: HTML-stripping program

HTML-stripping program

by Leonard H. Chalk :: Rate this Message:

Reply to Author | View in Thread

Check out OpenNLP, I use it's sentence splitter in my bot and it is the
best I have seen.  You can find more information at
http://opennlp.sourceforge.net/.  If you want to start with HTML, then
use an XSLT transform something like:
<?xml version="1.0" encoding="UTF-8" ?>

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match=text()>
  <xsl:value-of select="."/>
</xsl:template>

</xsl:stylesheet>

 (sorry, I just free handed the xslt and didn't test it, but it should
be close).  Xslt is the way to go to strip html out.  I recommend using
OpenNLP to get the sentences after that.

_______________________________________________
alicebot-developer mailing list
alicebot-developer@...
http://list.alicebot.org/mailman/listinfo/alicebot-developer

 « Return to Thread: HTML-stripping program

LightInTheBox - Buy quality products at wholesale price