Texttechnologie/Digitale Linguistik-Kolloquium HS 2022

Kolloquium HS 2022: Berichte aus der aktuellen Forschung am Institut, Bachelor- und Master-Arbeiten, Programmierprojekte, Gastvorträge

Zeit & Ort: alle 14 Tage dienstags von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.10 (Karte)

Online-Teilnahme via das MS Teams Team CL Colloquium ist auch möglich.

Verantwortlich: Dr. Mathias Müller

Colloquium Schedule

Date

Speaker

Topic
Tuesday, 20.09.2022

Marek Kostrzewa

A graph neural network approach for simple-complex sentence alignment

Anastassia Shaitarova The impact of machine-generated language on natural language: A corpus linguistic exploration of commercial MT systems.
Tuesday, 04.10.2022 Prof. Dr. Mascha Kurpicz-Briki (Bern University of Applied Sciences) Natural Language Processing for Clinical Burnout Detection
Tuesday, 18.10.2022

David Reich

Language models as eye openers

Patrick Haller
Measurement Reliability of Individual Differences in Sentence Processing

Thursday, 27.10.2022

(17:15)

Department of Computational Linguistics hosts an invited talk in the IFI Kolloquium:

Dr. Sepideh Alassi (University of Basel)

 

Tuesday, 01.11.2022

(16:15)

Zoom link

Dr. Jesse Dodge (Allen Institute for AI)

Improving Transparency in the Science of NLP

Tuesday, 15.11.2022

Dr. Mathias Müller

Recent developments in sign language machine translation

Tannon Kew A hidden consequence of seq2seq pre-training objectives
Tuesday, 29.11.2022

Jason Armitage

A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues
Jan Brasser Enhancing Zero-Anaphor Resolution Models with Eye-Movement Data
Tuesday, 13.12.2022

Dr. Pedro Ortiz Suarez (University of Mannheim)

The OSCAR Project: On Language Classification and Document Filtering for Multilingual Heterogeneous Web-Based Corpora

Abstracts

 

Marek Kostrzewa: A graph neural network approach for simple-complex sentence alignment

Language has an intrinsically compositional and hierarchical structure, and existing embedding approaches cannot fully exploit this property while learning embedding vectors using deep neural networks. Graphs are universal data structures that represent complex systems with their interrelation, allowing the modelling of both properties of individual objects and the relationship between them. Graph-structure representation of linguistic documents is a promising attempt to overcome some limitations in the vector space.
In order to fully capture and exploit a text's compositional and hierarchical structure, we propose to convert documents into heterogeneous multiplex graphs to introduce sequential, syntactic and semantic relations explicitly. In this presentation, I will show the results of applying graph-structure representations to the monolingual alignment task and its effectiveness in simple-complex sentence alignment.

 

Anastassia Shaitarova: The impact of machine-generated language on natural language: A corpus linguistic exploration of commercial MT systems

Machine Translation (MT) has become an integral part of daily life for millions of people, and the scope of human exposure to MT output is growing. Since MT output can be fluent, users often remain unaware that they are exposed to machine-produced text, and there is concern about the effects of increased MT exposure on human language. To address this problem, it is necessary to study the particularities of MT-produced texts.
Commercial MT technology advances continuously, necessitating regular updates in the evaluation of MT engines. We work with three publicly available systems (DeepL, Microsoft Azure, PONS) on several corpora of different domains and investigate their output using a number of established metrics. Additionally we look at one specific lexical feature, namely the distribution of anglicisms in the German translations. We find that MT output still yields to human translation in terms of lexical and syntactic diversity, though sometimes only marginally. PONS showed the highest lexical variability among the investigated commercial systems. DeepL employs the least number of anglicisms even compared to human translation.

 

Prof. Dr. Mascha Kurpicz-Briki: Natural Language Processing for Clinical Burnout Detection

To identify burnout in clinical intervention, so-called inventories are used. Inventories are psychological tests, where the person concerned fills out a questionnaire. This currently used metric, in both practice and most studies, has some limitations. Due to the overhead of manual evaluation, the state-of-the-art does not use free-text questions or interview transcripts, even though there have been promising approaches in the literature. In our research, we investigate how methods from natural language processing (NLP) can be applied to enable new directions in clinical psychology/psychiatry.

 

David Reich: Language models as eye openers

Human visual perception is an active process. How we move our eyes is highly informative about the (often unconscious) processes that unfold in our minds. Moreover, they reflect a complex interplay of perception, attention, and oculomotor control, and that's why eye movements are frequently studied in cognitive psychology. Eye tracking devices with high sampling frequency and precision are expensive. The applicability of eye movement-based applications heavily depends on the cost of the recording devices. Further, the data acquisition is very labor and expertise intensive. In this talk, we will explore how language models can help us tackle one part of this problem.

 

Patrick Haller: Measurement Reliability of Individual Differences in Sentence Processing

In recent years, several researchers have pointed out the neglect of individual differences (IDs) in theories explaining human cognition and sentence processing in particular. The first step for a principled investigation of IDs in sentence processing is to establish test-retest reliability (TRR) of theoretically relevant psycholinguistic effects (e.g., the stability of the participant-specific surprisal effects across several experimental sessions). TRR cannot be taken as a given due to the so-called reliability paradox. However, it is likely that precisely effects with high group-level replicability constitute the set of well-established psycholinguistic phenomena. We will present a multi-session, multi-method (eye-tracking and self-paced-reading) study investigating the TRR of well-established effects in sentence processing including lower-level lexical effects (lexical frequency, word length) and higher-level effects involving syntactic processing (surprisal, dependency length, number of left dependents). Our results suggest that participants’ sensitivity to word-length effects are stable across experimental sessions, in particular if assessed via eye-tracking measures. Individual differences in higher-level effects in sentence processing are generally less stable and vary with respect to measure and method.

 

Dr. Jesse Dodge: Improving Transparency in the Science of NLP

Natural language processing and machine learning have grown tremendously in recent years, and researchers hold myriad opinions on what to report in their papers. In this talk I will present a high-level overview of the NLP Reproducibility Checklist and the Responsible NLP Checklist, which provide general recommendations for what information to report in NLP papers. Then, I will dive into some efforts on improving transparency for the contents of web-scale datasets, including C4, a massive unlabeled text corpus built from web-crawled data. Given time, I will also discuss recent work on transparency around CO2 emissions of AI systems.

 

Dr. Mathias Müller: Recent developments in sign language machine translation

In this talk I will summarize recent efforts in sign language machine translation. We recently worked on a new shared task on sign language translation, better evaluation methodology for gloss translation and SignWriting translation systems.

I will conclude with a more general view, discussing the extent to which sign languages are in fact included in NLP research and what remains to be done.

 

Tannon Kew: A hidden consequence of seq2seq pre-training objectives

Comparisons of self-supervised denoising objectives for pre-training encoder-decoder language models have found only negligible differences in performance on downstream tasks. While this may be true in terms of performance with standard evaluation metrics, the design of these pre-training objectives has a significant influence on the flexibility of a model after fine-tuning.
In this work, we compare different denoising objectives employed by popular seq2seq pre-trained LMs in a controlled experiment setting. Our findings show that, with all other factors being equal, pre-training the model to reconstruct the input in full is a crucial pre-requisite for achieving zero-shot control with context augmentation.
This suggests that pre-training objectives that optimise for efficiency can prohibit desirable properties in a pre-trained LM and limit its applicability in some downstream settings.

 

Jason Armitage: A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues

Navigation in the world depends on attending to relevant cues at the right time. A road user in an urban environment is presented with billboards, moving traffic, and other people - but at an intersection will pinpoint a single light to check if it contains the colour red. At a neurophysiological level, this process is mediated by a priority map - a neural mechanism that guides attention by matching low-level signals on salient objects with high-level signals on task goals. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also challenged with detecting relevant cues amid a stream of linguistic and visual cues. Our cross-modal priority map module (PM-VLN) takes inspiration from prioritisation in humans to guide transformer-based systems to relevant information for action selections in VLN. Individual components of the PM-VLN are pretrained on auxiliary tasks with low-sample datasets to tackle the core challenge of aligning and localising information over linguistic instructions and visual inputs on the surrounding environment. The module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers in the Touchdown benchmark for VLN.
This presentation will detail our contributions that aim to tackle the core challenge of aligning and localising relevant information in the linguistic and visual temporal sequences presented to agents. The performance of our systems will be demonstrated by a comparison of results to existing benchmarks. Assessments of the contribution of specific operations in our frameworks and the role of training data will also be presented.

 

Jan Brasser: Enhancing Zero-Anaphor Resolution Models with Eye-Movement Data

Anaphor resolution remains a difficult problem in various NLP tasks. The problem is even more complicated in languages that allow for the ommision of certain phrases instead of using, for example, pronouns to refer to antecedents. This phenomenon is commonly referred to as a zero-anaphor. Examples for languages using zero-anaphors, sometimes called "pro-drop" languages, are Italian and Spanish, that allow the ommision of subjects, or Japanese and Mandarin, in which a wide variety of phrases can be omitted if they are clear from the context. In my talk, I will present the state of the art methods for zero-pronoun detection and zero-anaphor resolution and provide an outlook for future developements in the field. Particularly, I will outline a possible approach to integrate eye-movement data into models designed for this task.

 

Dr. Pedro Ortiz Suarez: The OSCAR Project: On Language Classification and Document Filtering for Multilingual Heterogeneous Web-Based Corpora

As demand for large corpora increases with the size of current state-of-the-art language models, using web data as the main part of the pre-training corpus for these models has become an ubiquitous practice. However, as many studies have reported, web data is highly heterogeneous and nosy, so common NLP methods for tasks such as language classification and topic classification quickly break down with web data.

In this talk we will present the current efforts for the OSCAR project to overcome the difficulties in heterogeneity, noisiness and size of these web resources, in order to produce higher quality textual data for as many languages as possible. We will also discuss recent developments on the project, to annotate and classify said data at scale, as well as our first efforts to become a fully open source project with and manage our thriving community.