Web scraping with eXist: http vs. httpclient module?

View: New views
12 Messages — Rating Filter:   Alert me  

Web scraping with eXist: http vs. httpclient module?

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

I have a question about the 'http' and 'httpclient' modules.

I have been looking for a way to scrape (non-X)HTML with XQuery, and I
found a useful article at
http://xquery.typepad.com/xquery/2006/10/the_best_scrape.html
explaining how to approach this problem.  Adam Retter added an
eXist-specific comment:

> This can also be done in XQuery using the eXist Open Source Native XML Database.
>
> Instead of xdmp:http-get() you can use html:doc() and you do not need xdmp:tidy() as html:doc() will tidy the HTML into a suitable XML form automagically.

I looked at the eXist extension module documentation at
http://exist-db.org/extensions.html#module_http and went to conf.xml
to enable this module (my first time enabling a module).  What I found
instead of the 'http' module was an 'httpclient' module.  Did the
module's name change?  Are these two modules actually the same?

Suspecting that these might be the same module, I enabled
'httpclient', but when I looked at its function documentation, I
didn't see an httpclient:doc() function.

Is there still a function that provides the combined, automagical
scraping/tidying abilities that Adam described?

Thanks in advance,
Joe

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Adam Retter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Joe,

Yes this is possible, sorry that post is a little out of date, the
httpclient module grew out of the html module. It provides all the
functionality of the original module and much more :-)

The function you now want is: httpclient:get()
part of the function description gives it away - "HTML body content
will be tidied into an XML compatible form" ;-)

Cheers Adam.

2008/6/26 Joe Wicentowski <joewiz@...>:

> Hi all,
>
> I have a question about the 'http' and 'httpclient' modules.
>
> I have been looking for a way to scrape (non-X)HTML with XQuery, and I
> found a useful article at
> http://xquery.typepad.com/xquery/2006/10/the_best_scrape.html
> explaining how to approach this problem.  Adam Retter added an
> eXist-specific comment:
>
>> This can also be done in XQuery using the eXist Open Source Native XML Database.
>>
>> Instead of xdmp:http-get() you can use html:doc() and you do not need xdmp:tidy() as html:doc() will tidy the HTML into a suitable XML form automagically.
>
> I looked at the eXist extension module documentation at
> http://exist-db.org/extensions.html#module_http and went to conf.xml
> to enable this module (my first time enabling a module).  What I found
> instead of the 'http' module was an 'httpclient' module.  Did the
> module's name change?  Are these two modules actually the same?
>
> Suspecting that these might be the same module, I enabled
> 'httpclient', but when I looked at its function documentation, I
> didn't see an httpclient:doc() function.
>
> Is there still a function that provides the combined, automagical
> scraping/tidying abilities that Adam described?
>
> Thanks in advance,
> Joe
>
> -------------------------------------------------------------------------
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>



--
Adam Retter

eXist Developer
{ England }
adam@...
irc://irc.freenode.net/existdb

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi Adam,

Great, now I understand.  I now have httpclient:get scraping.  Many thanks.

I knew that the post was dated 2006 and that the module might've
changed!  However, perhaps the documentation on
http://exist-db.org/extensions.html#module_http should be updated so
that the class and namespace are 'httpclient' instead of 'http'?  (I'm
not a patch-submitting-level programmer, but I could certainly help
with documentation when I can...  would that be helpful?  Please let
me know how I can help.)

I noticed also that when I use the exist-db.org sandbox and invoke
httpclient, the sandbox returns an error, even though the httpclient
module appears in the function library documentation (and thus must be
enabled in conf.xml).  For example, the following xquery works on my
local installation:

declare namespace httpclient="http://exist-db.org/xquery/httpclient";
let $uri := xs:anyURI("http://www.state.gov/r/pa/ei/bgn/index.htm")
let $tidied := httpclient:get($uri, false(), ())
for $countries in
$tidied/httpclient:body/html/body/table[1]/tbody[1]/tr[2]/td[3]/p[4]/a
return $countries

But on exist-db.org's sandbox, this error appears:
  org.exist.xquery.XPathException: Call to undeclared function:
httpclient:get [at line 3, column 16]

I know the sandbox isn't intended to have full functionality (as has
been discussed here recently), but I thought I'd mention this just in
case.

Again, thanks for your help!
- Joe


On Thu, Jun 26, 2008 at 6:01 AM, Adam Retter <adam@...> wrote:

> Hi Joe,
>
> Yes this is possible, sorry that post is a little out of date, the
> httpclient module grew out of the html module. It provides all the
> functionality of the original module and much more :-)
>
> The function you now want is: httpclient:get()
> part of the function description gives it away - "HTML body content
> will be tidied into an XML compatible form" ;-)
>
> Cheers Adam.
>
> 2008/6/26 Joe Wicentowski <joewiz@...>:
>> Hi all,
>>
>> I have a question about the 'http' and 'httpclient' modules.
>>
>> I have been looking for a way to scrape (non-X)HTML with XQuery, and I
>> found a useful article at
>> http://xquery.typepad.com/xquery/2006/10/the_best_scrape.html
>> explaining how to approach this problem.  Adam Retter added an
>> eXist-specific comment:
>>
>>> This can also be done in XQuery using the eXist Open Source Native XML Database.
>>>
>>> Instead of xdmp:http-get() you can use html:doc() and you do not need xdmp:tidy() as html:doc() will tidy the HTML into a suitable XML form automagically.
>>
>> I looked at the eXist extension module documentation at
>> http://exist-db.org/extensions.html#module_http and went to conf.xml
>> to enable this module (my first time enabling a module).  What I found
>> instead of the 'http' module was an 'httpclient' module.  Did the
>> module's name change?  Are these two modules actually the same?
>>
>> Suspecting that these might be the same module, I enabled
>> 'httpclient', but when I looked at its function documentation, I
>> didn't see an httpclient:doc() function.
>>
>> Is there still a function that provides the combined, automagical
>> scraping/tidying abilities that Adam described?
>>
>> Thanks in advance,
>> Joe
>>
>> -------------------------------------------------------------------------
>> Check out the new SourceForge.net Marketplace.
>> It's the best place to buy or sell services for
>> just about anything Open Source.
>> http://sourceforge.net/services/buy/index.php
>> _______________________________________________
>> Exist-open mailing list
>> Exist-open@...
>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>
>
>
>
> --
> Adam Retter
>
> eXist Developer
> { England }
> adam@...
> irc://irc.freenode.net/existdb
>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Adam Retter-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Its actually the page on the eXist website that is out of date, im not
sure why it wasnt updated when the last release went out...

Help with documentation is always welcome, personally I would love to
see a cookbook, a kind of set of pages about achieving different tasks
with eXist - for example screen-scrapping some html ;-) However there
may be more pressing documentation needs that perhaps Wolfgang or one
of the other devs could suggest, if your up for it?

Cheers Adam.

2008/6/27 Joe Wicentowski <joewiz@...>:

> Hi Adam,
>
> Great, now I understand.  I now have httpclient:get scraping.  Many thanks.
>
> I knew that the post was dated 2006 and that the module might've
> changed!  However, perhaps the documentation on
> http://exist-db.org/extensions.html#module_http should be updated so
> that the class and namespace are 'httpclient' instead of 'http'?  (I'm
> not a patch-submitting-level programmer, but I could certainly help
> with documentation when I can...  would that be helpful?  Please let
> me know how I can help.)
>
> I noticed also that when I use the exist-db.org sandbox and invoke
> httpclient, the sandbox returns an error, even though the httpclient
> module appears in the function library documentation (and thus must be
> enabled in conf.xml).  For example, the following xquery works on my
> local installation:
>
> declare namespace httpclient="http://exist-db.org/xquery/httpclient";
> let $uri := xs:anyURI("http://www.state.gov/r/pa/ei/bgn/index.htm")
> let $tidied := httpclient:get($uri, false(), ())
> for $countries in
> $tidied/httpclient:body/html/body/table[1]/tbody[1]/tr[2]/td[3]/p[4]/a
> return $countries
>
> But on exist-db.org's sandbox, this error appears:
>  org.exist.xquery.XPathException: Call to undeclared function:
> httpclient:get [at line 3, column 16]
>
> I know the sandbox isn't intended to have full functionality (as has
> been discussed here recently), but I thought I'd mention this just in
> case.
>
> Again, thanks for your help!
> - Joe
>
>
> On Thu, Jun 26, 2008 at 6:01 AM, Adam Retter <adam@...> wrote:
>> Hi Joe,
>>
>> Yes this is possible, sorry that post is a little out of date, the
>> httpclient module grew out of the html module. It provides all the
>> functionality of the original module and much more :-)
>>
>> The function you now want is: httpclient:get()
>> part of the function description gives it away - "HTML body content
>> will be tidied into an XML compatible form" ;-)
>>
>> Cheers Adam.
>>
>> 2008/6/26 Joe Wicentowski <joewiz@...>:
>>> Hi all,
>>>
>>> I have a question about the 'http' and 'httpclient' modules.
>>>
>>> I have been looking for a way to scrape (non-X)HTML with XQuery, and I
>>> found a useful article at
>>> http://xquery.typepad.com/xquery/2006/10/the_best_scrape.html
>>> explaining how to approach this problem.  Adam Retter added an
>>> eXist-specific comment:
>>>
>>>> This can also be done in XQuery using the eXist Open Source Native XML Database.
>>>>
>>>> Instead of xdmp:http-get() you can use html:doc() and you do not need xdmp:tidy() as html:doc() will tidy the HTML into a suitable XML form automagically.
>>>
>>> I looked at the eXist extension module documentation at
>>> http://exist-db.org/extensions.html#module_http and went to conf.xml
>>> to enable this module (my first time enabling a module).  What I found
>>> instead of the 'http' module was an 'httpclient' module.  Did the
>>> module's name change?  Are these two modules actually the same?
>>>
>>> Suspecting that these might be the same module, I enabled
>>> 'httpclient', but when I looked at its function documentation, I
>>> didn't see an httpclient:doc() function.
>>>
>>> Is there still a function that provides the combined, automagical
>>> scraping/tidying abilities that Adam described?
>>>
>>> Thanks in advance,
>>> Joe
>>>
>>> -------------------------------------------------------------------------
>>> Check out the new SourceForge.net Marketplace.
>>> It's the best place to buy or sell services for
>>> just about anything Open Source.
>>> http://sourceforge.net/services/buy/index.php
>>> _______________________________________________
>>> Exist-open mailing list
>>> Exist-open@...
>>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> eXist Developer
>> { England }
>> adam@...
>> irc://irc.freenode.net/existdb
>>
>



--
Adam Retter

eXist Developer
{ England }
adam@...
irc://irc.freenode.net/existdb

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sure, I'm up for helping out with documentation.  Wolfgang - do you
have any needs?

Also, I've thought of at some point documenting my TEI-based website,
something like: building a TEI-based site with eXist.  James Cummings
wrote a nice paper for a conference in Kyoto, but I think the
discussion of apache got in the way a bit of the elegance of a pure
eXist/XQuery/XSLT approach.

- Joe

On Mon, Jun 30, 2008 at 6:51 AM, Adam Retter <adam@...> wrote:

> Its actually the page on the eXist website that is out of date, im not
> sure why it wasnt updated when the last release went out...
>
> Help with documentation is always welcome, personally I would love to
> see a cookbook, a kind of set of pages about achieving different tasks
> with eXist - for example screen-scrapping some html ;-) However there
> may be more pressing documentation needs that perhaps Wolfgang or one
> of the other devs could suggest, if your up for it?
>
> Cheers Adam.

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Chris Wallace :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

There is an example of using httpclient for page scraping in the
XQuery Wikibook

amongst other examples which use doc() on well-formed XML pages like those from Wikipedia.

I've been working on this book for a year or so now, and would welcome critique, suggestions for examples and contributions of examples.  Scripts in the book are (mainly) executable on a server at my University. The book lacks document-centric examples, such as TEI  and also needs some attention which I hope to give it this summer. It would be great to have extended articles on design and development such as the one you suggest.

Chris

Joe Wicentowski wrote:
Also, I've thought of at some point documenting my TEI-based website,
something like: building a TEI-based site with eXist.

Re: Web scraping with eXist: http vs. httpclient module?

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> Sure, I'm up for helping out with documentation.  Wolfgang - do you
> have any needs?

Indeed, what eXist lacks most right now are some introductory articles,
which demonstrate from start to end how you develop an eXist-based
website. The existing documentation just describes all the various
interfaces, function modules, indexes and so on, but it does not really
explain how to put everything together.

The XQuery book already fills this gap by providing many well-explained
examples. It should be prominently linked from the eXist main site.
However, there should at least be one start-to-end tutorial which can be
shipped with eXist and which uses one example to explain the very basics
like: how to deploy an XQuery, how to import and use modules, how to
create the necessary indexes and load data, how to generate your HTML
from XQuery, and so on. But most important, it should help people to
understand the big picture, i.e. the "pure eXist/XQuery/XSLT approach".
Certainly, the article could be mirrored on the wiki book or vice versa.

> Also, I've thought of at some point documenting my TEI-based website,
> something like: building a TEI-based site with eXist.  James Cummings
> wrote a nice paper for a conference in Kyoto, but I think the
> discussion of apache got in the way a bit of the elegance of a pure
> eXist/XQuery/XSLT approach.

Yes, TEI-based data could be a good starting point. We would also need
some data-centric queries to explain index configurations, but I think
this could be done with TEI as well.

Wolfgang

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by alex-448 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Wolfgang wrote:
> Indeed, what eXist lacks most right now are some introductory articles,
> which demonstrate from start to end how you develop an eXist-based
> website. The existing documentation just describes all the various
> interfaces, function modules, indexes and so on, but it does not really
> explain how to put everything together.

I'd like to suggest the Sandbox for this purpose. In that it loads modules,
executes queries against the db, and transforms the results, it encapsulates
the basics of what app-builders would want to do with exist.

So never mind the buggy content completion, "slots", animations and resizings
and so on - strip all that out to make it simple and understandable as
possible; make it as bulletproof as you can so that A) it is indeed a "good
development/debug interface" and B) xquery newbies can be confident that any
anomalous results are their own fault, not the sandbox's; and comment the hell
out of it.

Would make a good first step IMHO without the work of creating something
entirely new.

--alex.


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> I'd like to suggest the Sandbox for this purpose. In that it loads modules,
> executes queries against the db, and transforms the results, it encapsulates
> the basics of what app-builders would want to do with exist.

The sandbox might serve as an example for AJAX, but not so much for
XQuery. Most of the code is really just Javascript, while the XQuery
part is relatively simple.

Anyway, the main problem with the sandbox is that it is itself
implemented in XQuery. It thus executes XQuery code within XQuery. This
has side effects in some cases, e.g. if the XQuery entered by the user
does itself import modules or modifies the HTTP response.

To make the sandbox more reliable, we should probably stop using
util:eval here and instead post the query to some servlet which runs
outside of the current context.

Wolfgang

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The sandbox was instrumental for me in understanding the power of
XQuery and native XML databases.  Similarly, the ability to execute
XQueries in the XQuery wikibook (kudos to you, Chris, for your
efforts, and anyone else who has contributed there!) is really
helpful.  They're great resources for explaining and illustrating, and
I'll certainly make use of them in the tutorial I hope to write.  As
Wolfgang suggests, I'll concentrate on introducing the big picture
about the "pure XQuery-eXist" approach as a platform for web
development.

In terms of structuring the tutorial, I'm thinking of a brief
introduction which lures the reader into the power of XQuery with some
of the cool things you can do (as shown by the sandbox and the
excellent examples in the wikibook), and a body which will give the
reader what Wolfgang calls "the very basics": the
module/function/xquery model of an .xq file, how to generate your HTML
from XQuery, and how to get your data into eXist.  (If the examples
don't require custom index configuration, I may leave that to an
advanced article.)  Given other things I'm working on I won't be able
to work on a draft right away or complete it particularly quickly, but
I would like to shoot out some questions to the list that would be
really helpful as I prepare the tutorial:

1. Since eXist is Java-based and runs either on its own with Jetty or
in servlet mode with Tomcat, are there any webhosting solutions that
we can recommend?  Wordpress, for example, provides a list of
recommended hosts (see http://wordpress.org/hosting/ ).

2. Can anyone suggest a resource for understanding how to do the
following from scratch: get and install Apache and Tomcat, install
eXist as a servlet, and configure Apache/eXist/Tomcat so that instead
of a URL such as: http://localhost:8080/exist/rest/db/mysite/index.xq
you can simply go to http://localhost/ ?  This might be a part of a
more advanced tutorial, but I think it's something that anyone who
wants to build a simple site would be quite likely to want to do.

3. Building on question 2, can anyone suggest a resource for
configuring support for the kind of URI so in style these days --
Rails-style SEO-friendly URIs such as
site.com/blog/20080212/Hipster_PDA.  eXist now seems to deal best with
URIs in the form  site.com/blog.xq?date=20080212&article=Hipster_PDA .
 As you can tell, I'm an Apache and Tomcat novice; any suggestions for
setting them up to pass eXist the URI parameters it needs would be
much appreciated.

4. One topic I'll cover is templating commonly used snippets of code
such as headers and footers.  Currently I use XIncludes for headers
and footers (i.e. <xi:include href="/db/mysite/includes/header.xq" />
).  Another approach is to use functions and imported modules for
this.  Are there any disadvantages to the XInclude approach, other
than the fact that the log displays the xincluded content each time
it's processed?  I'd like to be able to show how XQuery can let you do
pretty radical modularization and separation of code and content.

I'll start with these questions, and as I think through my approach
and start to draft the tutorial, I'll certainly ask more.  Thanks in
advance for your advice.

- Joe


On Tue, Jul 8, 2008 at 2:08 PM, Wolfgang <wolfgang@...> wrote:

>> I'd like to suggest the Sandbox for this purpose. In that it loads modules,
>> executes queries against the db, and transforms the results, it encapsulates
>> the basics of what app-builders would want to do with exist.
>
> The sandbox might serve as an example for AJAX, but not so much for
> XQuery. Most of the code is really just Javascript, while the XQuery
> part is relatively simple.
>
> Anyway, the main problem with the sandbox is that it is itself
> implemented in XQuery. It thus executes XQuery code within XQuery. This
> has side effects in some cases, e.g. if the XQuery entered by the user
> does itself import modules or modifies the HTTP response.
>
> To make the sandbox more reliable, we should probably stop using
> util:eval here and instead post the query to some servlet which runs
> outside of the current context.
>
> Wolfgang
>
> -------------------------------------------------------------------------
> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
> Studies have shown that voting for your favorite open source project,
> along with a healthy diet, reduces your potential for chronic lameness
> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
> _______________________________________________
> Exist-open mailing list
> Exist-open@...
> https://lists.sourceforge.net/lists/listinfo/exist-open
>

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Wolfgang Meier-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

> 3. Building on question 2, can anyone suggest a resource for
> configuring support for the kind of URI so in style these days --
> Rails-style SEO-friendly URIs such as
> site.com/blog/20080212/Hipster_PDA.  eXist now seems to deal best with
> URIs in the form  site.com/blog.xq?date=20080212&article=Hipster_PDA .
>  As you can tell, I'm an Apache and Tomcat novice; any suggestions for
> setting them up to pass eXist the URI parameters it needs would be
> much appreciated.

I wrote a component (XQueryURLRewrite) for the AtomicWiki project, which
deals with URL rewriting/forwarding. In the wiki, most URLs translate to
one main XQuery script. XQueryURLRewrite handles that and many other
cases. The code does already ship with the latest eXist distributions,
but it still needs a bit of tuning (to minimize the overhead) and
documentation. I planned to finish this soon.

The component works similar to the urlrewrite Java package
(http://tuckey.org/urlrewrite/), but does all the URL parsing and
rewriting in XQuery. It basically implements a filter which passes all
HTTP requests to an XQuery and the return value of that XQuery
determines how the HTTP request is handled. This way, I don't need to
struggle with another complex XML configuration language, but just use
eXist's HTTP XQuery modules.

Wolfgang

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open

Re: Web scraping with eXist: http vs. httpclient module?

by Joe Wicentowski :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Sorry for the slow reply, Wolfgang - I got caught up in reading the
AtomicWiki XQuery source code, and then got back to thinking about a
tutorial...  but back to your reply:

On Wed, Jul 9, 2008 at 5:31 AM, Wolfgang <wolfgang@...> wrote:

> I wrote a component (XQueryURLRewrite) for the AtomicWiki project, which
> deals with URL rewriting/forwarding. In the wiki, most URLs translate to one
> main XQuery script. XQueryURLRewrite handles that and many other cases. The
> code does already ship with the latest eXist distributions, but it still
> needs a bit of tuning (to minimize the overhead) and documentation. I
> planned to finish this soon.
>
> The component works similar to the urlrewrite Java package
> (http://tuckey.org/urlrewrite/), but does all the URL parsing and rewriting
> in XQuery. It basically implements a filter which passes all HTTP requests
> to an XQuery and the return value of that XQuery determines how the HTTP
> request is handled. This way, I don't need to struggle with another complex
> XML configuration language, but just use eXist's HTTP XQuery modules.

That sounds perfect.  Even more pure XQuery!  I'll be following the
development of XQueryURLRewrite and its documentation with
anticipation.

- Joe

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Exist-open mailing list
Exist-open@...
https://lists.sourceforge.net/lists/listinfo/exist-open
LightInTheBox - Buy quality products at wholesale price