Navigation auf


Department of Computational Linguistics

Domain-specific Statistical Machine Translation

The Department of Computational Linguistics investigates the use of small domain-specific corpora for Statistical Machine Translation (SMT). This research is motivated by our experiences with industry partners who wish to build translation systems for specific application areas, but only have little domain-specific training data at their disposal. We have a small parallel corpus of Alpine texts (5 million tokens) at our disposal: the publications of the Swiss Alpine Club (SAC) were digitized in the project Text+Berg digital, parts of the corpus being parallel (DE-FR). We investigated the combination of the Text+Berg corpus with other resources, for instance additional monolingual, parallel or comparable corpora, or other machine translation systems.

Focus of the research project

  • Use of domain-specific parallel corpora for SMT: corpus creation, sentence alignment and cost-benefit-analysis.
  • Extraction of domain-specific translations from comparable corpora.
  • Combination of domain-specific and out-of-domain parallel corpora.
  • Combination of domain-specific and general-purpose machine translation systems.
  • Use and Improvement of NLP Resources (Name Classifiers, PoS-Taggers, Parsers) in Englisch, French and German in order to improve SMT.
  • Building tools for multilingual terminology visualisation.
  • Building a parallel treebank DE-FR for evaluation purposes.

Project head:


The project was funded by the Swiss National Science Foundation and ran 2010-2013.

Project results


ZORA Publication List

Download Options


Weiterführende Informationen


Teaser text