SMULTRON German lemmatisation guidelines

These guidelines were used for selecting German lemmas in SMULTRON, the Stockholm MULtilingual TReebank.


Each token gets exactly one lemma.


Each token gets one lemma as suggested by the morphology system Gertwol.


The following tokens do not get a lemma:

  • XML tags
  • Punctuation symbols
  • Numbers that consist only of digits (and related symbols)
      Examples: 37  4,9  30'261   +4.3
  • Roman numbers
      Examples:  CX  IV

Case 1: Multiple Gertwol Lemmas

If Gertwol provides more than one lemma for the given token (and the given Part-of-Speech), then the most appropriate lemma is chosen in the given context. For nouns, adjectives and verbs this selection is done automatically as specified in the paper

Martin Volk: Choosing the right lemma when analysing German nouns. In: Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der GLDV. Frankfurt. 1999. 304-310.

This refers in particular to the disambiguation between different possible segmentations.


	Zweifelsfall	--> Zweifel\s#fall  (correct)
			--> Zwei#fels#fall  (hilarious but unlikely ;-)

The disambiguation between multiple lemmas for other word classes is done manually.

Known problems:

Perfect participle forms of verbs that can belong to two different verbs, need to be manually checked.


	gehört  --> hören vs. gehören
	geraten --> raten vs. geraten

Case 2: No Gertwol Lemma

If Gertwol does not provide a lemma, then the human annotator chooses the correct lemma.

If the word is a proper name (PoS=NE), then the lemma is identical with the word form, unless the name is in genitive. The genitive suffix -s will be removed.


	IBMs      -->  IBM
	Schröders -->  Schröder

If the word is a foreign word (PoS=FM), then the lemma is identical with the word form, unless it is an English word in plural. In that case the suffix -s will be removed.


	Directors  -->  Director

If a foreign word is identical with a German (loan) word, there might be a Gertwol lemma for it (including segmentations). In this case the Gertwol lemma is not used.


	Process Automation  -->  Process Automation

Although there exists the Gertwol lemma 'Automat~ion'.

If the token is an abbreviation, then the full word is taken as the basis for lemmatisation.


	Mio.  -->  Million
	%     -->  Prozent

Acronyms are not spelled out.

	SEB   -->  SEB
	USA   -->  USA

Deviations from the Gertwol Suggestions

If the token is an elliptical compound, then the full compound is taken as the basis for lemmatisation.


	Nord- und Südamerika		-->  Nord#amerika und Süd#amerika
	Energie- und Automationstechnik	-->  Energ~ie#techn~ik und Automat~ion\s#techn~ik

If the token is a determiner or an attributive pronoun, then we choose the determiner in the correct gender as the lemma.


	der Mann        -->  der Mann
	der Frau        -->  die Frau
	meiner Zeitung  -->  meine Zeitung 

Type Information

In order to subclassify foreign words and names we assign the following labels.

Each foreign word (PoS=FM) gets a label specifying its language. The label is the two-character ISO language code.


	Board  -->  Board   EN
	Crédit -->  Crédit  FR

If a foreign word is considered a loan word and is no longer PoS-tagged as FM (but rather as noun or verb or something else), then it does not get a language label.

If a foreign word is part of a compound (including hyphenated compounds) with a German word, then it will get the PoS tag of the German word, and it will get a combined language label reflecting the origin of the parts.


	Compliance-Aktivität		-->  PoS=NN and Language-label=EN-DE
	Corporate-Kosten		-->  PoS=NN and Language-label=EN-DE
	Upstream-Geschäft		-->  PoS=NN and Language-label=EN-DE
	Building-Systems-Geschäft	-->  PoS=NN and Language-label=EN-EN-DE

Each name (PoS=NE) gets a label specifying its name type. We use the following labels.

  • GEO - (Part of a) Geographical Name (town, country, continent, river, ...)
  • ORG - (Part of an) Organisation Name (company, institution, ...)
  • PERS - (Part of a) Person Name
  • WEB - Internet Address (web address or email address)
  • MISC - All other Names


	Mittelamerika	-->  Mittel#amerika   GEO
	ABB		-->  ABB              ORG
	Karlsson	-->  Karlsson         PERS

Gender Information

Each noun (PoS=NN) and each name (PoS=NE) gets a label specifying its grammatical gender. We use the following labels.

  • FEM - feminin
  • MASK - masculine
  • NEUTR - neuter
  • NONE - none (e.g. family names like 'Amundsen' have no gender)


	Tür      -->  Tür      FEM
	Mädchen  -->  Mädchen  NEUTR
	Sofie    -->  Sofie    FEM
	Müller   -->  Müller   NONE  - if it is used as a family name

Foreign words (PoS=FM) do not get a gender label.

Nouns that occur only in plural (Pluralia Tanta) do not get a gender label.


	Leute   -->	Leute ___

Known problems:

Nouns that are derived from verbs and have changed gender, often get the wrong gender assignment. They must be manually corrected. For example

	der Packen  vs. das Packen
	der Rasen   vs. das Rasen
	den Rätseln vs. das Rätseln

Nouns that have different genders in different readings need to be manually checked.

	der Teil   vs. das Teil
	der Moment vs. das Moment
	der Flur   vs. die Flur

Misspelled Words

If a word is misspelled, then the word is not corrected according to the principle of faithfulness to the original text. But the (imagined) corrected word is taken as the basis for the lemmatisation.


	abgeschaft     -->  ab|schaff~en
	Personlkosten  -->  Person~al#kosten