Zur Navigation | Zum Inhalt
FVCML0208 10
Corpora multilingui
OPUS - an open source parallel corpus PDF Stampa E-mail
Domenica 03 Novembre 2013 11:50

OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.

Here some of the highlights:

- over 150 languages and language variants
- over 5 billion aligned translation units
- downloads in XML/XCES, plain text (Moses/SMT) and TMX
- raw, tokenized and machine-annotated data
- monolingual data sets (for language modeling)
- search interfaces

Some recent news and data sets:

- EUbookshop: a large but noisy corpus (converted from PDF)
- Tatoeba: a small but clean corpus with many languages
- OpenSubtitles2012: an improved version of the 2011 version
- coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
- UN, MultiUN, Europarl v7: aligned for all language combinations
- word alignments and phrase tables for the majority of bitexts

The Web Site: http://opus.lingfil.uu.se
More information: http://opus.lingfil.uu.se/trac/wiki


[from Corpora-list]

PatTR: Patent Translation Resource PDF Stampa E-mail
Lunedì 15 Aprile 2013 21:43

A parallel corpus of patent text for the German-English language pair.

The corpus has been constructed from EPO, WIPO and USPTO patent documents extracted from the MAREC collection and contains 23 million sentence pairs from all patent text sections.

All sentences are labeled with metadata: patent document id, patent family, patent classification and publication date.

The corpus is distributed under a Creative Commons License. For more information and download, please see

TRACTOR PDF Stampa E-mail
Martedì 12 Maggio 2009 17:31

TRACTOR (TELRI Research Archive of Computational Tools and Resources) è un progetto mantenuto dal Centre for Corpus linguistics dell’università di Birmingham. Si tratta di un archivio di materiali e software per l’analisi di corpora. Tra le lingue trattate ci sono le principali lingue europee e anche bulgaro, ceco, le lingue baltiche, rumeno, russo e altre.


TRIPTIC (TRIlingual Parallel Text Information Corpus) PDF Stampa E-mail
Martedì 12 Maggio 2009 17:30

TRIPTIC (TRIlingual Parallel Text Information Corpus) è un corpus di inglese, francese e olandese composto da circa 2 milioni di parole, in testi paralleli allineati.


REAL Parallel Corpus PDF Stampa E-mail
Martedì 12 Maggio 2009 17:29

REAL Parallel Corpus (German-English Translation Corpus) raccoglie testi paralleli in inglese (americano e britannico) e tedesco.


<< Inizio < Prec. 1 2 3 Succ. > Fine >>

Pagina 1 di 3