-
Notifications
You must be signed in to change notification settings - Fork 3
FeforParCorp
AnetteFrank edited this page Jun 19, 2006
·
31 revisions
* Europarl Corpus
- URL: http://people.csail.mit.edu/koehn/publications/europarl/
- [http://www.dfki.de/~frank/Europarl_sample Samples of Europarl Corpus]
- Languages: da, de, en, el, es, fi, fr, it, nl, pt, sv
- Size per language: 600-700k sents
- Format: currently distributed over approx. 400 files
- Alignment: implicit by basename of file and relative position in raw sentence-separated ascii files
- Todo: reformatting, preferrably in xml:
- <sent> element with embedded elements for the different languages (da, .. sv), with attributes for length (tokens), and reference to original filename
Home | Forum | Discussions | Events