Skip to content

FeforParCorp

MontserratMarimon edited this page Jul 5, 2006 · 31 revisions

Parallel Corpora for Delph-In

Collections/Samples of available parallel corpora

* Europarl Corpus

- URL: http://people.csail.mit.edu/koehn/publications/europarl/

- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus]

- Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv

- Size per language: 600-700k sents

- Format: currently distributed over approx. 400 files

- Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files

- Todo: complete cross-lingual alignment (currently only pair-wise implicit alignment). Possibly we can get something along these lines from Andreas Eisele.

* OPUS: Technical Documentation (plus Europarl and European Constitution)

- URL: http://logos.uio.no/opus/

* The Sofie Treebank

- The treebank was developed by the participants of the Nordic Treebank Network, in which academic institutions from Denmark, Estonia, Finland, Iceland, Norway, and Sweden took part. Information about status, availability, formats and analyses can be found at - URL: http://www.hf.uio.no/tekstlab/prosjekter/SOFIE.html

* The JRC-Acquis Multilingual Parallel Corpus - URL: http://langtech.jrc.it/JRC-Acquis.html#Introduction

Clone this wiki locally