Navigation auf


URPP Language and Space Language and Space Lab

Upstream text processing


Many practical applications in Natural Language Processing (NLP), such as machine translation and speech recognition, benefit from text preprocessing steps which reduce data sparsity. For example, morphological text processing can help reduce sparsity through segmenting words into morphemes (morphological segmentation) or mapping inflected forms of words to their lemmas (lemmatization). Another example is normalization of writing: mapping surface word forms to their canonical forms through reducing dialectological variation or correcting spelling errors. In many cases, such upstream tasks can be formulated as sequence transformation tasks and solved with the same neural sequence-to-sequence technology that is used in neural machine translation (NMT) and speech processing. In this project, we develop systems for a range of upstream tasks by enriching character-level sequence-to-sequence models with structural signal derived from multiple text organization layers: characters, morphemes, words and sentences. 

Project members: Tatiana Ruzsics (PhD student) and Tanja Samardžić (PI).

Funding: URPP "Language and Space" (UZH internal)

NMT System with target context encoding via Higher-Level Language Model: Synchronized decoding

NMT System with source context encoding via Hierarchical biLSTM and PoS tags

NMT System with Hard Attention and Copy Mechanism

Weiterführende Informationen



Ruzsics, T.,  O. Sozinova, X. Gutierrez-Vasques and  T. Samardžić (2021). "Interpretability for Morphological Inflection: from Character-level Predictions to Subword-level Rules". Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online.


Ruzsics, T. and T. Samardžić (Draft).  

"Multilevel text normalization with sequence-to-sequence networks and multisource learning" . ArXiv


Ruzsics, T.,  Lusetti, M., A. Göhring, T. Samardžić  and E. Stark (2019). "Neural text normalization with adapted decoding and PoS features". Natural Language Engineering. 585 - 605. Cambridge University Press. Pre-print


Lusetti, M., T. Ruzsics,  A. Göhring, T. Samardžić  and E. Stark (2018). "Encoder-Decoder Methods for Text Normalization". In Proceedings of the Workshop Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (COLING 2018). Santa Fe, New Mexico, USA.


Makarov P., T. Ruzsics, and S. Clematide (2017). "Align and copy: UZH at SIGMORPHON 2017 shared task for morphological reinflection". In Proceedings of the CoNLL- SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, Canada. Overall winner of Task 1.


Ruzsics, T. and T. Samardžić (2017). "Neural Sequence-to-sequence Learning of Internal Word Structure". In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Vancouver, Canada.