Wiktionary parsers

View: New views
3 Messages — Rating Filter:   Alert me  

Wiktionary parsers

by Andrew Dunbar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

I'm writing a new Wiktionary parser and I'm wondering if anybody else
who has made or is making or wants to make a Wiktionary parser would
like to share some thoughts.

My main aim is to mine translation data to use with my other project,
Linguaphile, a language translator.

At the moment I'm parsing the XML dump file but I also want an
interface to fetch wiktext from the live Wiktionary.

I'm focusing on the English Wiktionary first because I know its
format, but I'd also like to target the other bigger Wiktionaries.

Another thing I'm thinking about is a central repository for
Wiktionary parser source code. The code I'm making now is in Perl but
I've also done smaller amounts of parsing in PHP and Javascript and
I'm sure others have code in Python.

I know several people have parsed the English Wiktionary - has anybody
made parsers for other Wiktionaries yet?

Let's hear what you are working on.

Andrew Dunbar (hippietrail)

--
http://wiktionarydev.leuksman.com http://linguaphile.sf.net

_______________________________________________
Wiktionary-l mailing list
Wiktionary-l@...
http://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Re: Wiktionary parsers

by Minh Nguyen-2 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Andrew Dunbar wrote:

> I'm writing a new Wiktionary parser and I'm wondering if anybody else
> who has made or is making or wants to make a Wiktionary parser would
> like to share some thoughts.
>
> My main aim is to mine translation data to use with my other project,
> Linguaphile, a language translator.
>
> At the moment I'm parsing the XML dump file but I also want an
> interface to fetch wiktext from the live Wiktionary.
>
> I'm focusing on the English Wiktionary first because I know its
> format, but I'd also like to target the other bigger Wiktionaries.
>
> Another thing I'm thinking about is a central repository for
> Wiktionary parser source code. The code I'm making now is in Perl but
> I've also done smaller amounts of parsing in PHP and Javascript and
> I'm sure others have code in Python.
>
> I know several people have parsed the English Wiktionary - has anybody
> made parsers for other Wiktionaries yet?
>
> Let's hear what you are working on.
>
> Andrew Dunbar (hippietrail)

The wiktionary.py class of pywikipediabot [1] has "alpha" support for
the English and Dutch Wiktionaries. Since several large Wiktionaries use
the Dutch "templatized" format, it should be simple to extend support to
those wikis as well.

[1] http://pywikipediabot.sourceforge.net/

--
Minh Nguyen <mxn@...>
[[en:User:Mxn]] [[vi:User:Mxn]] [[m:User:Mxn]]
AIM: trycom2000; Jabber: mxn@...; Blog: http://notes.1ec5.org/


_______________________________________________
Wiktionary-l mailing list
Wiktionary-l@...
http://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Re: Wiktionary parsers

by Andrew Dunbar :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

My Wiktionary parser is now available vis svn on the toolserver:
http://fisheye.ts.wikimedia.org/browse/hippietrail/wiktparser

It's not a full parser yet. I'm developing several reusable libraries
and a couple of small apps which use them.

Libraries:

* DumpParser.pm knows about the XML dump file format, including namespaces.
* WiktParser.pm knows about parts of how the English Wiktionary
articles are formatted.
* WiktLang.pm relates language names and synonyms and alternative
spellings to language codes.

Apps:

* wiktparser.pl extracts nouns of a given language along with their
gender and homonym and sense numbers. It also produces a log file of
entries which it could not parse.
* extractlangcodes.pl looks for all templates and articles which
contain information relating language codes to language names or vice
versa and outputs a table of which sets of language names relate to
which set of language codes.

Please try out these tools and comment here. I'm actively refactoring
and generalizing the code now rather than trying to extract other
parts of speech or parse more variants of headword/inflection lines or
definition lines.

Andrew Dunbar (hippietrail)

_______________________________________________
Wiktionary-l mailing list
Wiktionary-l@...
http://lists.wikimedia.org/mailman/listinfo/wiktionary-l
LightInTheBox - Buy quality products at wholesale price!