wkipedia rendering engine

View: New views
20 Messages — Rating Filter:   Alert me  
< Prev | 1 - 2 | Next >

wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi,

I was at the erlang exchange and heard the *magnificant*  talk

"Building a transactional distributed data store with Erlang", by
Alexander Reinefeld.

I'll be blogging this as soon as I have the URL of the video of the talk.

(in advance of this there was talk at the google conference on scalability

http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en

oh and they also seem to have won the SCALE 2008 prize at the
CCGrid conferense in Lyon but there is zero publicity about this AFAICS
)

We (collectively) promised to help Alexander - I promised to provide him with a
rendering engine (in Erlang) for the wikipedia markup language.

Before I start hacking has anybody done this before?

/Joe Armstrong
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Parent Message unknown Re: wkipedia rendering engine

by Andre Engels :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:

> We (collectively) promised to help Alexander - I promised to provide him with a
> rendering engine (in Erlang) for the wikipedia markup language.
>
> Before I start hacking has anybody done this before?

What exactly do you mean by a 'rendering engine'? Translating the
markup language (its name is Mediawiki, by the way) to something else?

It's not a trivial task you have set yourself. There are some elements
that are quite complex, for example the fact that '' is italics and
''' is bold. Notice the difference between:

'''this is bold'''

'''this is italic, starting with a ' ''

'''this is bold '' and this part italic as well '''''

Also deciding on what point of the analysis to expand {{templates}}
can lead the same code to get very different results.

--
Andre Engels, andreengels@...
ICQ: 6260644 -- Skype: a_engels
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Parent Message unknown Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 12:38 PM, Andre Engels <andreengels@...> wrote:
> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>
>> We (collectively) promised to help Alexander - I promised to provide him with a
>> rendering engine (in Erlang) for the wikipedia markup language.
>>
>> Before I start hacking has anybody done this before?
>
> What exactly do you mean by a 'rendering engine'? Translating the
> markup language (its name is Mediawiki, by the way) to something else?

I want a number of functions

     mediaWiki_to_rtf(bin()) -> rtf().
     rtf_to_html(rtf()) -> html().
     rtf_to_pdf(rtf()) -> pdf()

etc. where rtf(), html() pdf() are abstract datav types representing
(abstracted) rich text, html, and pdf() etc.

The rendering engine is a wrapper round these routines to display ther
result in a browser or generate PDF etc.

>
> It's not a trivial task you have set yourself. There are some elements
> that are quite complex, for example the fact that '' is italics and
> ''' is bold. Notice the difference between:
>
> '''this is bold'''
>
> '''this is italic, starting with a ' ''
>
> '''this is bold '' and this part italic as well '''''
>

This is almost trivial :-)

> Also deciding on what point of the analysis to expand {{templates}}
> can lead the same code to get very different results.

Glurk - you mean it's *undefined* - wow - I'll guess I'll discover this

 /Joe
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Parent Message unknown Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 12:38 PM, Andre Engels <andreengels@...> wrote:
> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>
>> We (collectively) promised to help Alexander - I promised to provide him with a
>> rendering engine (in Erlang) for the wikipedia markup language.
>>

Thanks - I didn't know the name of the format - seems like the
processof parsing is
reasonably easy -- see

http://www.mediawiki.org/wiki/Markup_spec#The_Markup_Language


It seems pretty amazing that there is no formal specifiation of the
grammar of the markup language and that this
is decided *after* there are a few quadzillion pages of markup text :-)

/J

>> Before I start hacking has anybody done this before?
>
> What exactly do you mean by a 'rendering engine'? Translating the
> markup language (its name is Mediawiki, by the way) to something else?
>
> It's not a trivial task you have set yourself. There are some elements
> that are quite complex, for example the fact that '' is italics and
> ''' is bold. Notice the difference between:
>
> '''this is bold'''
>
> '''this is italic, starting with a ' ''
>
> '''this is bold '' and this part italic as well '''''
>
> Also deciding on what point of the analysis to expand {{templates}}
> can lead the same code to get very different results.
>
> --
> Andre Engels, andreengels@...
> ICQ: 6260644 -- Skype: a_engels
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Andre Engels :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 12:53 PM, Joe Armstrong <erlang@...> wrote:

>> Also deciding on what point of the analysis to expand {{templates}}
>> can lead the same code to get very different results.
>
> Glurk - you mean it's *undefined* - wow - I'll guess I'll discover this

Might well be undefined in the sense that the only definition is "what
the (php) code does is right", I'm not sure. The code seems to be in
http://svn.wikimedia.org/viewvc/mediawiki/branches/stable/phase3/includes/OutputPage.php?view=log

--
Andre Engels, andreengels@...
ICQ: 6260644 -- Skype: a_engels
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Hynek Vychodil :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Mon, Jun 30, 2008 at 12:58 PM, Joe Armstrong <erlang@...> wrote:
On Mon, Jun 30, 2008 at 12:38 PM, Andre Engels <andreengels@...> wrote:
> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>
>> We (collectively) promised to help Alexander - I promised to provide him with a
>> rendering engine (in Erlang) for the wikipedia markup language.
>>

Thanks - I didn't know the name of the format - seems like the
processof parsing is
reasonably easy -- see

http://www.mediawiki.org/wiki/Markup_spec#The_Markup_Language


It seems pretty amazing that there is no formal specifiation of the
grammar of the markup language and that this
is decided *after* there are a few quadzillion pages of markup text :-)

/J

{joke, "It's not amazing, it's wikiworld!"}.
 


>> Before I start hacking has anybody done this before?
>
> What exactly do you mean by a 'rendering engine'? Translating the
> markup language (its name is Mediawiki, by the way) to something else?
>
> It's not a trivial task you have set yourself. There are some elements
> that are quite complex, for example the fact that '' is italics and
> ''' is bold. Notice the difference between:
>
> '''this is bold'''
>
> '''this is italic, starting with a ' ''
>
> '''this is bold '' and this part italic as well '''''
>
> Also deciding on what point of the analysis to expand {{templates}}
> can lead the same code to get very different results.
>
> --
> Andre Engels, andreengels@...
> ICQ: 6260644 -- Skype: a_engels
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions



--
--Hynek (Pichi) Vychodil
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Followup to myself.

I guess the wikipedia is stored internally in the format that is
presented to the users for editing
ie in MetaWiki markup language.

Is there a REST interface so that I can retreive the latest version of
the MetaWiki markup for a specific page with, for example,
a wget command.

Has anybody made an erlang interface to scrape individual pages from
the wikipedia - or to bulk convert the entire
wikipedia to erlang terms :-)

/Joe



On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:

> Hi,
>
> I was at the erlang exchange and heard the *magnificant*  talk
>
> "Building a transactional distributed data store with Erlang", by
> Alexander Reinefeld.
>
> I'll be blogging this as soon as I have the URL of the video of the talk.
>
> (in advance of this there was talk at the google conference on scalability
>
> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
>
> oh and they also seem to have won the SCALE 2008 prize at the
> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
> )
>
> We (collectively) promised to help Alexander - I promised to provide him with a
> rendering engine (in Erlang) for the wikipedia markup language.
>
> Before I start hacking has anybody done this before?
>
> /Joe Armstrong
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Andre Engels :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 1:23 PM, Joe Armstrong <erlang@...> wrote:

> Is there a REST interface so that I can retreive the latest version of
> the MetaWiki markup for a specific page with, for example,
> a wget command.

What's a REST interface? There's several ways to get the MediaWiki
markup of a specific page:
* Go to the edit page; it contains the latest version of the markup
* Go to [[Special:Export]], where you can get either the current
version or all versions of a number of pages, in XML
* At http://download.wikimedia.org/backup-index.html are the complete
database dumps of the various wikis; the content of the page is in one
of the tables


--
Andre Engels, andreengels@...
ICQ: 6260644 -- Skype: a_engels
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Jan Lehnardt-3 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
> Is there a REST interface so that I can retreive the latest version of
> the MetaWiki markup for a specific page with, for example,
> a wget command.

You can get bulk dumps http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get 
...

Why would you do individual scraping? In order to keep up to date with
changes that happened between the last dump and now()?

Cheers
Jan
--

> Has anybody made an erlang interface to scrape individual pages from
> the wikipedia - or to bulk convert the entire
> wikipedia to erlang terms :-)
>
> /Joe
>
>
>
> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...>  
> wrote:
>> Hi,
>>
>> I was at the erlang exchange and heard the *magnificant*  talk
>>
>> "Building a transactional distributed data store with Erlang", by
>> Alexander Reinefeld.
>>
>> I'll be blogging this as soon as I have the URL of the video of the  
>> talk.
>>
>> (in advance of this there was talk at the google conference on  
>> scalability
>>
>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
>>
>> oh and they also seem to have won the SCALE 2008 prize at the
>> CCGrid conferense in Lyon but there is zero publicity about this  
>> AFAICS
>> )
>>
>> We (collectively) promised to help Alexander - I promised to  
>> provide him with a
>> rendering engine (in Erlang) for the wikipedia markup language.
>>
>> Before I start hacking has anybody done this before?
>>
>> /Joe Armstrong
>>
>
>
>
> --
> fra@...; ingvar.akesson@...
>
> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
> De vill ju ändå läsa min e-post.]
>
> [A copy of this mail has been sent to
> FRA for monitoring purposes. FRA wants to read all my e-mail and have
> been allowed to do by the Swedish parliment - in violation of article
> 12 of the UN Universal Declaration of Human Rights]
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@...
> http://www.erlang.org/mailman/listinfo/erlang-questions
>

_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 1:33 PM, Andre Engels <andreengels@...> wrote:
> On Mon, Jun 30, 2008 at 1:23 PM, Joe Armstrong <erlang@...> wrote:
>
>> Is there a REST interface so that I can retreive the latest version of
>> the MetaWiki markup for a specific page with, for example,
>> a wget command.
>
> What's a REST interface?

http://en.wikipedia.org/wiki/Representational_State_Transfer

/J

 > There's several ways to get the MediaWiki
> markup of a specific page:
> * Go to the edit page; it contains the latest version of the markup
> * Go to [[Special:Export]], where you can get either the current
> version or all versions of a number of pages, in XML



> * At http://download.wikimedia.org/backup-index.html are the complete
> database dumps of the various wikis; the content of the page is in one
> of the tables
>
>
> --
> Andre Engels, andreengels@...
> ICQ: 6260644 -- Skype: a_engels
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <jan@...> wrote:

> On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
>>
>> Is there a REST interface so that I can retreive the latest version of
>> the MetaWiki markup for a specific page with, for example,
>> a wget command.
>
> You can get bulk dumps
> http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get...
>
> Why would you do individual scraping? In order to keep up to date with
> changes that happened between the last dump and now()?
>

To get a few test cases to test my parser on *before* download the entire thing.

Also I suspect the dumps are in MySQL format with xml junk - so it might not be
a trival job to extract the raw data. I (presumably) will have to
install MySQL and
turn some XML stuff into the raw data (just guessing here) - thought
that could be a job for a
volunteer :-)

/Joe


> Cheers
> Jan
> --
>
>> Has anybody made an erlang interface to scrape individual pages from
>> the wikipedia - or to bulk convert the entire
>> wikipedia to erlang terms :-)
>>
>> /Joe
>>
>>
>>
>> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>>>
>>> Hi,
>>>
>>> I was at the erlang exchange and heard the *magnificant*  talk
>>>
>>> "Building a transactional distributed data store with Erlang", by
>>> Alexander Reinefeld.
>>>
>>> I'll be blogging this as soon as I have the URL of the video of the talk.
>>>
>>> (in advance of this there was talk at the google conference on
>>> scalability
>>>
>>>
>>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
>>>
>>> oh and they also seem to have won the SCALE 2008 prize at the
>>> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
>>> )
>>>
>>> We (collectively) promised to help Alexander - I promised to provide him
>>> with a
>>> rendering engine (in Erlang) for the wikipedia markup language.
>>>
>>> Before I start hacking has anybody done this before?
>>>
>>> /Joe Armstrong
>>>
>>
>>
>>
>> --
>> fra@...; ingvar.akesson@...
>>
>> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
>> De vill ju ändå läsa min e-post.]
>>
>> [A copy of this mail has been sent to
>> FRA for monitoring purposes. FRA wants to read all my e-mail and have
>> been allowed to do by the Swedish parliment - in violation of article
>> 12 of the UN Universal Declaration of Human Rights]
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@...
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Thorsten Schuett :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Hi all,

as I am partially to blame for the noise around the wikirenderer, I will add
my two cents.

For our experiments, we used the XML dumps available at
http://download.wikimedia.org. We have a small Java program which converts
the XML dump to Erlang terms (http://www.zib.de/schuett/dumpreader.tgz). E.g.
converting the bavarian dump:
java -jar dumpreader.jar /home/schuett/barwiki-20080225-pages-meta-history.xml

But you still have to parse the mediawiki text and convert it to HTML.
For the last step we currently have two solutions:

1. Early experiments used flexbisonparse
(http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/) to convert
the mediawiki text to XML and XSLT to convert the XML to HTML.

2. The current code is based on plog4u/bliki( see
http://matheclipse.org/en/Java_Wikipedia_API)

Thorsten

On Monday 30 June 2008, Joe Armstrong wrote:

> On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <jan@...> wrote:
> > On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
> >> Is there a REST interface so that I can retreive the latest version of
> >> the MetaWiki markup for a specific page with, for example,
> >> a wget command.
> >
> > You can get bulk dumps
> > http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get..
> >.
> >
> > Why would you do individual scraping? In order to keep up to date with
> > changes that happened between the last dump and now()?
>
> To get a few test cases to test my parser on *before* download the entire
> thing.
>
> Also I suspect the dumps are in MySQL format with xml junk - so it might
> not be a trival job to extract the raw data. I (presumably) will have to
> install MySQL and
> turn some XML stuff into the raw data (just guessing here) - thought
> that could be a job for a
> volunteer :-)
>
> /Joe
>
> > Cheers
> > Jan
> > --
> >
> >> Has anybody made an erlang interface to scrape individual pages from
> >> the wikipedia - or to bulk convert the entire
> >> wikipedia to erlang terms :-)
> >>
> >> /Joe
> >>
> >> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
> >>> Hi,
> >>>
> >>> I was at the erlang exchange and heard the *magnificant*  talk
> >>>
> >>> "Building a transactional distributed data store with Erlang", by
> >>> Alexander Reinefeld.
> >>>
> >>> I'll be blogging this as soon as I have the URL of the video of the
> >>> talk.
> >>>
> >>> (in advance of this there was talk at the google conference on
> >>> scalability
> >>>
> >>>
> >>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+s
> >>>calable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
> >>>
> >>> oh and they also seem to have won the SCALE 2008 prize at the
> >>> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
> >>> )
> >>>
> >>> We (collectively) promised to help Alexander - I promised to provide
> >>> him with a
> >>> rendering engine (in Erlang) for the wikipedia markup language.
> >>>
> >>> Before I start hacking has anybody done this before?
> >>>
> >>> /Joe Armstrong
> >>
> >> --
> >> fra@...; ingvar.akesson@...
> >>
> >> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
> >> De vill ju ändå läsa min e-post.]
> >>
> >> [A copy of this mail has been sent to
> >> FRA for monitoring purposes. FRA wants to read all my e-mail and have
> >> been allowed to do by the Swedish parliment - in violation of article
> >> 12 of the UN Universal Declaration of Human Rights]
> >> _______________________________________________
> >> erlang-questions mailing list
> >> erlang-questions@...
> >> http://www.erlang.org/mailman/listinfo/erlang-questions


_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Parent Message unknown Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Mon, Jun 30, 2008 at 1:58 PM, Alain O'Dea <alain.odea@...> wrote:

> On Mon, Jun 30, 2008 at 9:10 AM, Joe Armstrong <erlang@...> wrote:
>> On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <jan@...> wrote:
>>> On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
>>>>
>>>> Is there a REST interface so that I can retreive the latest version of
>>>> the MetaWiki markup for a specific page with, for example,
>>>> a wget command.
>>>
>>> You can get bulk dumps
>>> http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get...
>>>
>>> Why would you do individual scraping? In order to keep up to date with
>>> changes that happened between the last dump and now()?
>>>
>>
>> To get a few test cases to test my parser on *before* download the entire thing.
>>
>> Also I suspect the dumps are in MySQL format with xml junk - so it might not be
>> a trival job to extract the raw data. I (presumably) will have to
>> install MySQL and
>> turn some XML stuff into the raw data (just guessing here) - thought
>> that could be a job for a
>> volunteer :-)
>>
>> /Joe
>>
>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>> Has anybody made an erlang interface to scrape individual pages from
>>>> the wikipedia - or to bulk convert the entire
>>>> wikipedia to erlang terms :-)
>>>>
>>>> /Joe
>>>>
>>>>
>>>>
>>>> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I was at the erlang exchange and heard the *magnificant*  talk
>>>>>
>>>>> "Building a transactional distributed data store with Erlang", by
>>>>> Alexander Reinefeld.
>>>>>
>>>>> I'll be blogging this as soon as I have the URL of the video of the talk.
>>>>>
>>>>> (in advance of this there was talk at the google conference on
>>>>> scalability
>>>>>
>>>>>
>>>>> http://video.google.com/videoplay?docid=-6526287646296437003&q=erlang+scalable&ei=cZ9oSLiDNIiCiwLL9fGwCA&hl=en
>>>>>
>>>>> oh and they also seem to have won the SCALE 2008 prize at the
>>>>> CCGrid conferense in Lyon but there is zero publicity about this AFAICS
>>>>> )
>>>>>
>>>>> We (collectively) promised to help Alexander - I promised to provide him
>>>>> with a
>>>>> rendering engine (in Erlang) for the wikipedia markup language.
>>>>>
>>>>> Before I start hacking has anybody done this before?
>>>>>
>>>>> /Joe Armstrong
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> fra@...; ingvar.akesson@...
>>>>
>>>> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
>>>> De vill ju ändå läsa min e-post.]
>>>>
>>>> [A copy of this mail has been sent to
>>>> FRA for monitoring purposes. FRA wants to read all my e-mail and have
>>>> been allowed to do by the Swedish parliment - in violation of article
>>>> 12 of the UN Universal Declaration of Human Rights]
>>>> _______________________________________________
>>>> erlang-questions mailing list
>>>> erlang-questions@...
>>>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>>>
>>>
>>>
>>
>>
>>
>> --
>> fra@...; ingvar.akesson@...
>>
>> [Kopia av detta meddelande skickas till FRA för övervakningsändamål.
>> De vill ju ändå läsa min e-post.]
>>
>> [A copy of this mail has been sent to
>> FRA for monitoring purposes. FRA wants to read all my e-mail and have
>> been allowed to do by the Swedish parliment - in violation of article
>> 12 of the UN Universal Declaration of Human Rights]
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@...
>> http://www.erlang.org/mailman/listinfo/erlang-questions
>>
>
> There is a REST interface, but it is not exactly machine-friendly. If
> you request http://en.wikipedia.org/w/index.php?title=<TOPIC
> NAME>&action=edit with a topic name put in you will get an editor
> page. For example
> http://en.wikipedia.org/w/index.php?title=Erlang%20(programming%20language)&action=edit
> brings up the editor page for the Erlang programming language.
>
> The raw MediaWiki markup is in a textarea with id "wpTextbox1", but
> unfortunately I have been unable to get xmerl to extract it due to the
> fact that the page is HTML and not well-formed XML.

Seems to work - should be easy to extract the content

Why bother with xmerl just scan the text for a constant string ...

<textarea tabindex='1' accesskey="," name="wpTextbox1" id="wpTextbox1"

This is really easy.

I wonder if this is what is in the database or has this been generated
from something else

/Joe




>
> I imagine a simple parser which looks for '<textarea', then
> 'id="xpTextbox1"', then '>', then gathers text until '</textarea'
> would work pretty well. I'll take a look at this when I get home this
> evening.
>



--
fra@...; ingvar.akesson@...

[Kopia av detta meddelande skickas till FRA för övervakningsändamål.
De vill ju ändå läsa min e-post.]

[A copy of this mail has been sent to
FRA for monitoring purposes. FRA wants to read all my e-mail and have
been allowed to do by the Swedish parliment - in violation of article
12 of the UN Universal Declaration of Human Rights]
_______________________________________________
erlang-questions mailing list
erlang-questions@...
http://www.erlang.org/mailman/listinfo/erlang-questions

Re: wkipedia rendering engine

by Joe Armstrong-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Rock and roll....

can you be more explicit than http://download.wikimedia.org can you
point me to a specific file
that I can download that works with your dump reader?

Thanks

/Joe


On Mon, Jun 30, 2008 at 2:13 PM, Thorsten Schuett <schuett@...> wrote:

> Hi all,
>
> as I am partially to blame for the noise around the wikirenderer, I will add
> my two cents.
>
> For our experiments, we used the XML dumps available at
> http://download.wikimedia.org. We have a small Java program which converts
> the XML dump to Erlang terms (http://www.zib.de/schuett/dumpreader.tgz). E.g.
> converting the bavarian dump:
> java -jar dumpreader.jar /home/schuett/barwiki-20080225-pages-meta-history.xml
>
> But you still have to parse the mediawiki text and convert it to HTML.
> For the last step we currently have two solutions:
>
> 1. Early experiments used flexbisonparse
> (http://svn.wikimedia.org/viewvc/mediawiki/trunk/flexbisonparse/) to convert
> the mediawiki text to XML and XSLT to convert the XML to HTML.
>
> 2. The current code is based on plog4u/bliki( see
> http://matheclipse.org/en/Java_Wikipedia_API)
>
> Thorsten
>
> On Monday 30 June 2008, Joe Armstrong wrote:
>> On Mon, Jun 30, 2008 at 1:36 PM, Jan Lehnardt <jan@...> wrote:
>> > On Jun 30, 2008, at 13:23, Joe Armstrong wrote:
>> >> Is there a REST interface so that I can retreive the latest version of
>> >> the MetaWiki markup for a specific page with, for example,
>> >> a wget command.
>> >
>> > You can get bulk dumps
>> > http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_do_I_get..
>> >.
>> >
>> > Why would you do individual scraping? In order to keep up to date with
>> > changes that happened between the last dump and now()?
>>
>> To get a few test cases to test my parser on *before* download the entire
>> thing.
>>
>> Also I suspect the dumps are in MySQL format with xml junk - so it might
>> not be a trival job to extract the raw data. I (presumably) will have to
>> install MySQL and
>> turn some XML stuff into the raw data (just guessing here) - thought
>> that could be a job for a
>> volunteer :-)
>>
>> /Joe
>>
>> > Cheers
>> > Jan
>> > --
>> >
>> >> Has anybody made an erlang interface to scrape individual pages from
>> >> the wikipedia - or to bulk convert the entire
>> >> wikipedia to erlang terms :-)
>> >>
>> >> /Joe
>> >>
>> >> On Mon, Jun 30, 2008 at 11:39 AM, Joe Armstrong <erlang@...> wrote:
>> >>> Hi,
>> >>>
>> >>> I was at the erlang exchange and heard the *magnificant*  talk
>> >>>
>> >>> "Building a transactional distributed data store with Erlang", by
>> >>> Alexander Reinefeld.
>> >>>
>> >>> I'll be blogging this as soon as I have the URL of the video of the
>> >>> talk.
>> >>>
>> >>> (in advance of this there was talk at the google conference on
>> >>> scalability
>> >>>
>> >>>
>> >>>