These guidelines were used for selecting German lemmas in SMULTRON, the Stockholm MULtilingual TReebank.
Each token gets exactly one lemma.
Each token gets one lemma as suggested by the morphology system Gertwol.
The following tokens do not get a lemma:
- XML tags
- Punctuation symbols
- Numbers that consist only of digits (and related symbols)
Examples: 37 4,9 30'261 +4.3
- Roman numbers
Examples: CX IV
Case 1: Multiple Gertwol Lemmas
If Gertwol provides more than one lemma for the given token (and the given Part-of-Speech), then the most appropriate lemma is chosen in the given context. For nouns, adjectives and verbs this selection is done automatically as specified in the paper
Martin Volk: Choosing the right lemma when analysing German nouns. In: Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der GLDV. Frankfurt. 1999. 304-310.
This refers in particular to the disambiguation between different possible segmentations.
Zweifelsfall --> Zweifel\s#fall (correct) --> Zwei#fels#fall (hilarious but unlikely ;-)
The disambiguation between multiple lemmas for other word classes is done manually.
Perfect participle forms of verbs that can belong to two different verbs, need to be manually checked.
gehört --> hören vs. gehören geraten --> raten vs. geraten
Case 2: No Gertwol Lemma
If Gertwol does not provide a lemma, then the human annotator chooses the correct lemma.
If the word is a proper name (PoS=NE), then the lemma is identical with the word form, unless the name is in genitive. The genitive suffix
-s will be removed.
IBMs --> IBM Schröders --> Schröder
If the word is a foreign word (PoS=FM), then the lemma is identical with the word form, unless it is an English word in plural. In that case the suffix
-s will be removed.
Directors --> Director
If a foreign word is identical with a German (loan) word, there might be a Gertwol lemma for it (including segmentations). In this case the Gertwol lemma is not used.
Process Automation --> Process Automation
Although there exists the Gertwol lemma 'Automat~ion'.
If the token is an abbreviation, then the full word is taken as the basis for lemmatisation.
Mio. --> Million % --> Prozent
Acronyms are not spelled out.
SEB --> SEB USA --> USA
Deviations from the Gertwol Suggestions
If the token is an elliptical compound, then the full compound is taken as the basis for lemmatisation.
Nord- und Südamerika --> Nord#amerika und Süd#amerika Energie- und Automationstechnik --> Energ~ie#techn~ik und Automat~ion\s#techn~ik
If the token is a determiner or an attributive pronoun, then we choose the determiner in the correct gender as the lemma.
der Mann --> der Mann der Frau --> die Frau meiner Zeitung --> meine Zeitung
In order to subclassify foreign words and names we assign the following labels.
Each foreign word (PoS=FM) gets a label specifying its language. The label is the two-character ISO language code.
Board --> Board EN Crédit --> Crédit FR
If a foreign word is considered a loan word and is no longer PoS-tagged as FM (but rather as noun or verb or something else), then it does not get a language label.
If a foreign word is part of a compound (including hyphenated compounds) with a German word, then it will get the PoS tag of the German word, and it will get a combined language label reflecting the origin of the parts.
Compliance-Aktivität --> PoS=NN and Language-label=EN-DE Corporate-Kosten --> PoS=NN and Language-label=EN-DE Upstream-Geschäft --> PoS=NN and Language-label=EN-DE Building-Systems-Geschäft --> PoS=NN and Language-label=EN-EN-DE
Each name (PoS=NE) gets a label specifying its name type. We use the following labels.
- GEO - (Part of a) Geographical Name (town, country, continent, river, ...)
- ORG - (Part of an) Organisation Name (company, institution, ...)
- PERS - (Part of a) Person Name
- WEB - Internet Address (web address or email address)
- MISC - All other Names
Mittelamerika --> Mittel#amerika GEO ABB --> ABB ORG Karlsson --> Karlsson PERS
Each noun (PoS=NN) and each name (PoS=NE) gets a label specifying its grammatical gender. We use the following labels.
- FEM - feminin
- MASK - masculine
- NEUTR - neuter
- NONE - none (e.g. family names like 'Amundsen' have no gender)
Tür --> Tür FEM Mädchen --> Mädchen NEUTR Sofie --> Sofie FEM Müller --> Müller NONE - if it is used as a family name
Foreign words (PoS=FM) do not get a gender label.
Nouns that occur only in plural (Pluralia Tanta) do not get a gender label.
Leute --> Leute ___
Nouns that are derived from verbs and have changed gender, often get the wrong gender assignment. They must be manually corrected. For example
der Packen vs. das Packen der Rasen vs. das Rasen den Rätseln vs. das Rätseln
Nouns that have different genders in different readings need to be manually checked.
der Teil vs. das Teil der Moment vs. das Moment der Flur vs. die Flur
If a word is misspelled, then the word is not corrected according to the principle of faithfulness to the original text. But the (imagined) corrected word is taken as the basis for the lemmatisation.
abgeschaft --> ab|schaff~en Personlkosten --> Person~al#kosten