Skip to content

FeforParCorp

AnetteFrank edited this page Jun 19, 2006 · 31 revisions

Parallel Corpora for Delph-In

Collections/Samples of available parallel corpora

* Europarl Corpus

- URL: http://people.csail.mit.edu/koehn/publications/europarl/

- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus]

- Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv

- Size per language: 600-700k sents

- Format: currently distributed over approx. 400 files

- Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files

- Todo: reformatting, preferrably in xml:

  • <sent> element with embedded elements for the different languages (da, .. sv), with attributes for length (tokens), and reference to original filename
Clone this wiki locally