Skip to content

FeforParCorp

FrancisBond edited this page Oct 24, 2007 · 31 revisions

Parallel Corpora for Delph-In

TableOfContents

Collections/Samples of available parallel corpora

Europarl Corpus

  • - URL: http://people.csail.mit.edu/koehn/publications/europarl/

    - [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus] - Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv - Size per language: 600-700k sents - Format: currently distributed over approx. 400 files - Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files - Todo: complete cross-lingual alignment (currently only pair-wise implicit alignment). Possibly we can get something along these lines from Andreas Eisele.

OPUS: Technical Documentation (plus Europarl and European Constitution)

- URL: http://logos.uio.no/opus/

The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

This is not redistributable:

  • "Permission to use the corpus can be given to those signing an agreement that they will only use the corpus for research, development and teaching. A web-form will be available soon, in the meantime, contact Lars Nygaard. If you already have got a permission, click here to use the corpus."

Translations in other languages exist (including Japanese), which we may be able to get permission for.

The JRC-Acquis Multilingual Parallel Corpus

Cathedral and the Bazaar

This is an early essay on Open Source. It is a little difficult (a lot of parentheticals and run-on sentences), but quite fun to read. It is about 800 sentences, which is small, but there are more essays if we want more data. There are several good translations (not all linked to the main page: e.g. a Spanish translation at <http://es.tldp.org/Otros/catedral-bazar/cathedral-es-paper-00.html%3E. It is freely available, but I (FCB) checked with the author anyway as a matter of courtesy and he was enthusiastic about us using it. There may be some clean up work involved in getting the translations aligned (there are several versions of the essay).

Wikipedia has a number of links to different translations: http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar (see on the left) (AF)

Thai version: http://linux.thai.net/~thep/catb/cathedral-bazaar/index.html

Language Participant Group URL Version
Catalan (ca) Barcelona http://www.danielclemente.com/apuntes/asai/recensio/catb.html ?
Chinese (zh) traditional Saarbrücken http://www.linux.org.tw/CLDP/OLD/doc/Cathedral-Bazaar.html (big5) 1.42
Chinese (zh) simplified Saarbrücken http://www.angeloliu.org/read-37.html ?
English (en) Stanford/Oslo http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/ 1.57
French (fr) Toulouse http://www.linux-france.org/article/these/cathedrale-bazar/cathedrale-bazar.html 1.4
German (de) Saarbrücken http://gnuwin.epfl.ch/articles/de/Kathedrale/ 1.45
Greek, Modern (el) Saarbrücken/Athens http://howto.hellug.gr/howto/pub/html/cathedral-bazaar.html ?
Japanese (ja) Kyoto http://cruel.org/freeware/cathedral.html 1.40
Korean (ko) Seoul http://wiki.kldp.org/wiki.php/DocbookSgml/Cathedral-Bazaar-TRANS 1.32
Norwegian (no) Trondheim NTNU TBA
Portuguese (pt) Lisbon http://www.geocities.com/CollegePark/Union/3590/pt-cathedral-bazaar.html 1.42
Spanish (es) Barcelona http://es.tldp.org/Otros/catedral-bazar/cathedral-es-paper-13.html 1.28
Swedish (sv) Linköping http://home.swipnet.se/swi/KatB-se.html 1.51

At NiCT we also have a 201 sentence aligned subset of en,ko,zh,de,pt,it,fr which we use for MT testing.

Treebanking this text leads to several interesting issues with text cleansing: italics, embedded quotations, list numbers and so forth, that it would be goodto discuss more generally:

  1. Treating your users as co-developers is your least-hassle route to rapid code improvement and effective debugging.

  2. When I expressed this opinion in his presence once, he smiled and quietly repeated something he has often said: "I'm basically a very lazy person who likes to get credit for things other people actually do."

  3. And loosely-coupled collaborations enabled by the Internet, a la Linux, were frequent.

Universal Declaration of Human Rights

The preamble (a multi paragraph sentence) is impossible, but apart from that it isn't too difficult, and gets some nice universal quantifiers and modals. It is a little short, but there are many other declarations. There are 365 different translations, most of excellent quality --- the multilinguality is the main selling point. It is freely available. There is a little synergy as it is the de facto standard for testing Unicode fonts --- it should print nicely.

Some criteria for choosing a corpus

  1. difficulty -- we need to have some hope of parsing it
  2. size --- to build statistical models it has to be a certain size
  3. quality --- the language should be natural (often a problem for translations)
  4. availability --- we need to be able to share the data
  5. multilinguality --- it would be nice to have exisiting translations
  6. relevance --- the genre should be one you are interested in
  7. synergy --- it is nice to reuse/complement existing markup
  8. diversity --- it can be interesting to experiment with a mixture of corpora, of different text types
Clone this wiki locally