Navigation auf


Institut für Computerlinguistik

Text Technology/Digital Linguistics colloquium FS 2024

Time & Location: every 2-3 weeks on Tuesdays from 10:15 am to 12:00 pm in room BIN-2-A.10.
Please note that the room has changed from the previous semester.

Online participation via the MS Teams Team CL Colloquium is also possible.

Responsible: Marius Huber

Colloquium Schedule

20 Feb 2024

Andrianos Michail: Robustness of Multilingual Embedding Models in Historical News

Embedding models are fundamental components of semantic search engines and other Natural Language Processing (NLP) systems, as they provide us with powerful vectorized representations of text ("embeddings"). But how can we judge whether one embedding model is better than another or diagnose perspectives for their improvement? While for English and even English-X language pairs, the situation appears mostly clear due to the availability of large-scale benchmarks, we still don't know much about the robustness of embeddings towards extremely heterogeneous texts as we can encounter "in the wild", i.e., texts that can be from a different language, from a different time, contain transcription errors and/or code-mixes, just to name a few common phenomena. To test such an open setting, we plan to build a testbed for embedding models from the IMPRESSO corpus that contains millions of digitized multi-lingual and tempo-spatially distributed news texts from more than two centuries. Are current embedding models up to the challenge?

Zifan Jiang: Recent Developments in Sign Language Processing: towards realistic sign language machine translation

Applying NLP tasks to sign languages is challenging primarily due to data scarcity and the absence of a well-established methodology. While it is still unclear whether an end-to-end or a pipeline approach will take the lead, we notice more basic problems to solve in sign language processing, including segmentation, alignment, and representation. On the one hand, we are working on releasing more and better-quality data that is publicly available. On the other, we draw inspiration from the recent advances in LLMs and deep pretrained models to guide our research in tackling the above-mentioned basic problems.

5 Mar 2024

Bryan Eikema: Why Are Modes of Natural Language Generation Models Inadequate?

The highest probability sequences of most neural language generation models tend to be degenerate in some way, a problem known as the inadequacy of the mode. While many approaches to tackling particular aspects of the problem exist, such as dealing with too short sequences or excessive repetitions, explanations of why it occurs in the first place are rarer and do not agree with each other.  In this talk we will discuss the current attempts at explaining this phenomenon and why we believe those to not paint a full picture. We will also provide an alternative hypothesis that links the inadequacy of the mode to the desire for our models to generalise to previously unseen contexts.

Mario Giulianelli: Measuring utterance uncertainty and predictability via simulation of contextually plausible alternatives

Viewing linguistic communication as information transmission between cognitive agents, successful language production can be understood as an act of reducing the uncertainty over future states that a comprehender may be anticipating. When an individual utters a sentence, they narrow down the comprehender's expectations, and they do so by an amount proportional to the contextual predictability of the utterance. I will discuss two recent studies that demonstrate how we can empirically estimate utterance uncertainty and predictability by simulating potential upcoming linguistic contributions using neural text generators. The first study introduces a statistical framework to quantify utterance uncertainty as production variability, and evaluates the alignment of language generators to the production variability observed in humans. We find that different types of production tasks exhibit distinct levels of lexical, syntactic, and semantic variability, and neural text generators generally achieve satisfactory calibration of uncertainty. In the second study, we use the previously introduced statistical framework to define a novel measure of utterance predictability, which we term information value. Information value quantifies predictability by measuring the distance from contextually plausible alternatives and offers advantages over traditional measures by disentangling various dimensions of uncertainty and being less influenced by surface form competition. Psycholinguistic experiments demonstrate that information value is a superior predictor of utterance acceptability in written and spoken dialogue compared to token-level surprisal aggregates, and that it complements surprisal in predicting eye-tracked reading times.

19 Mar 2024

Janis Goldzycher: TBA


Juri Opitz: TBA


16 Apr 2024

Sina Ahmadi: TBA


Masoumeh Chapariniya: TBA


30 Apr 2024

Chiara Tschirner: TBA


Pius von Däniken: TBA


14 May 2024

Alessia Battisti: TBA


Iuliia Thorbecke: TBA


28 May 2024

Lena Bolliger: TBA


Ann-Sophie Gnehm: TBA