FeforParCorp

Parallel Corpora for Delph-In

Collections/Samples of available parallel corpora

Europarl Corpus

- URL: http://people.csail.mit.edu/koehn/publications/europarl/

- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus] - Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv - Size per language: 600-700k sents - Format: currently distributed over approx. 400 files - Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files - Todo: complete cross-lingual alignment (currently only pair-wise implicit alignment). Possibly we can get something along these lines from Andreas Eisele.

OPUS: Technical Documentation (plus Europarl and European Constitution)

The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

Some criteria for choosing a corpus

difficulty -- we need to have some hope of parsing it
size --- to build statistical models it has to be a certain size
quality --- the language should be natural (often a problem for translations)
availability --- we need to be able to share the data
multilinguality --- it would be nice to have exisiting translations
relevance --- the genre should be one you are interested in
synergy --- it is nice to reuse/complement existing markup

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FeforParCorp

Parallel Corpora for Delph-In

Collections/Samples of available parallel corpora

Europarl Corpus

OPUS: Technical Documentation (plus Europarl and European Constitution)

The Sofie Treebank

The JRC-Acquis Multilingual Parallel Corpus

Cathedral and the Bazaar

Universal Devlaration of Human Rights

Some criteria for choosing a corpus

Clone this wiki locally