Skip to content

FeforParCorp

EmilyBender edited this page Aug 4, 2006 · 31 revisions

Parallel Corpora for Delph-In

TableOfContents

Collections/Samples of available parallel corpora

Europarl Corpus

  • - URL: http://people.csail.mit.edu/koehn/publications/europarl/

    - [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus] - Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv - Size per language: 600-700k sents - Format: currently distributed over approx. 400 files - Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files - Todo: complete cross-lingual alignment (currently only pair-wise implicit alignment). Possibly we can get something along these lines from Andreas Eisele.

OPUS: Technical Documentation (plus Europarl and European Constitution)

- URL: http://logos.uio.no/opus/

The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

The JRC-Acquis Multilingual Parallel Corpus

Cathedral and the Bazaar

Universal Devlaration of Human Rights

Some criteria for choosing a corpus

  1. difficulty -- we need to have some hope of parsing it
  2. size --- to build statistical models it has to be a certain size
  3. quality --- the language should be natural (often a problem for translations)
  4. availability --- we need to be able to share the data
  5. multilinguality --- it would be nice to have exisiting translations
  6. relevance --- the genre should be one you are interested in
  7. synergy --- it is nice to reuse/complement existing markup
Clone this wiki locally