Texttechnologie-Kolloquium HS 2021

Kolloquium HS 2021: Berichte aus der aktuellen Forschung am Institut, Bachelor- und Master-Arbeiten, Programmierprojekte, Gastvorträge

Zeit & Ort: alle 14 Tage dienstags von 10.15 Uhr bis 12.00 Uhr, BIN-2.A.01 (Karte)

(Am 2.11. und 14.12. Online)

Verantwortlich: Dr. Tilia Ellendorff

Colloquium Schedule

Date

Speaker & Topic

21.09.21 CANCELLED
05.10.21 CANCELLED
19.10.21

Tannon Kew: 

Getting More with Less: Improving Specificity in Hospitality Review Response Generation through Data-Driven Data Curation

Janis Goldzycher (short presentation):

Adjusting the Word Embedding Association Test for Austrian, German and Swiss Demographics

02.11.21

Eva Vanmassenhove [ONLINE SESSION]:

Gender Bias in Machine Translation

16.11.21

Anastassia Shaitarova: 

IMAGINE: The IMpact of Automatic Language Generation on Linguistic INtuition and Language Evolution.

Olga Sozinova: 

How do you tokenize? Humans vs. algorithms

30.11.21

Jason Armitage: 

Learning to Navigate Virtual Environments with Linguistic, Visual, and Location-based Data

Marek Kostrzewa: 

Monolingual simple-complex sentence alignment through a learning to rank (LTR) approach

14.12.21

Patrick Haller: 

Eye-tracking based classification of Chinese readers with and without Dyslexia

Noemi Aepli: 

Transfer learning between similar languages

Abstracts

 

19.10.2021 

Getting More with Less: Improving Specificity in Hospitality Review Response Generation through Data-Driven Data Curation
Neural network-based approaches to conditional text generation have been shown to deliver highly fluent and natural looking texts in a wide variety of tasks. However, in open-ended tasks such as response generation or dialogue modelling, models tend to learn a strong, undesirable bias towards generating overly generic outputs. This can, at least, be partially attributed to characteristics of the underlying training data, suggesting that finding ways to improve the quality of the training data at scale is crucial. In this talk, I will present results from experiments aimed at improving thematic specificity in review response generation for the hospitality domain. These experiments focus primarily on data-driven approaches to quantify ‘genericness’ in the training corpus and subsequently filter undesirable and uninformative examples. Using both automatic metrics and human evaluation, we show that such targeted data filtering, despite reducing the amount of training data to 40% of its original size, improves specificity in the resulting generated responses considerably.

 

Adjusting the Word Embedding Association Test for Austrian, German and Swiss Demographics

Introducing the Word Embedding Association Test (WEAT), Caliskan et al. (2017) showed that English word embeddings often contain human-like biases, such as racial and gender bias. Lauscher and Glavaš (2019) extended these tests to other languages, including German. However, these bias tests are still tailored to American demographics. I will argue that for this reason the tests provide only an imprecise measure with multiple distorting factors for the biases in question. Subsequently, I will present an experiment setup that considers the specific demographic circumstances of Austria, Germany and Switzerland and embedding tests for racial bias, anti-immigrant bias, gender bias and antisemitic bias. The results reveal stronger and more biases than previously found for German Wikipedia-based embeddings and weaker and fewer biases for embeddings based on other corpora.

02.11.2021

Gender Bias in Machine Translation

Natural Language Processing (NLP) tools are increasingly popular, making it vital for researchers to identify the potential role they play in shaping societal biases. Recent work showed how NLP technology can not only propogate but also excerbate bias encountered in training corpora. During this talk, we aim to explore bias sources, how it can affect our technology and what we can possibly do to mitigate gender bias.

A special focus is given to the field of Machine Translation (MT), where, due to contrastive linguistic differences, gender bias can become very apparent. When translating from one language into another, original author traits are partially lost, both in human and machine translations. However, in the field of MT one of the most observable consequences of this missing information are morphologically incorrect variants due to a lack of agreement in number and gender with the subject. Such errors harm the overall fluency and adequacy of the translated sentence. 

Human translators rely on contextual information to infer the gender of the speaker in order to make an informed decision and pick the correct morphological agreement. However, most current MT systems do not; they simply exploit statistical dependencies on the sentence level that have been learned from large amounts of parallel data. Furthermore, sentences are translated in isolation. As a consequence, pieces of information necessary to determine the gender of the speakers, might get lost. The MT system will, in such cases, opt for the statistically most likely variant, which depending on the training data can be either the male or the female form. This approach has several shortcomings: (a) lack of an in-depth analysis of the data used for training, (b) training contains bias (usually on many levels: racial, gender...), (c) neural networks do not only learn the existing bias but exacerbate it.

Gender information can be integrated into the training process of machine translation systems by using techniques similar to the one used for zero-shot-translation leading to more controllability when it comes to translating ambiguous English sentences. However, a more in-depth human analysis reveals side-effects of the current approach(es) as well as a lack of consistency in terms of controllability. Aside from highlighting some of the issues related to gender and Machine Translation, in this talk, we would also like to touch upon a more fundamental related problem: the loss of lexical/linguistic richness in current MT systems.

16.11.2021

IMAGINE: The IMpact of Automatic Language Generation on Linguistic INtuition and Language Evolution

The IMAGINE project runs within the framework of the NCCR Evolving Language and is meant to investigate the future developments within natural language. In order to understand how generated texts might be influencing the way a language evolves we set our research in three different directions. First, we conduct a corpus linguistics investigation and examine how commercially available machine translation systems handle a phenomenon that plays a crucial role in language evolution, namely borrowings. More specifically, we look at the use of anglicisms in German. Additionally, we inspect human-written and machine-produced texts in terms of lexical richness and syntactic equivalence. Second, we organize psycholinguistic experiments to determine how much the output of machine translation influences syntactic and lexical processing of language learners. Third, we intend to work on automatic detection of short generated texts. 

 

How do you tokenize? Humans vs. algorithms
Words can be split into segments by human annotators or by algorithms. The resulting segmentations vary due to different linguistic intuitions in humans, and due to particular designs of algorithms. The main difference between those methods lies in the plausibility of the resulting segments. But how is plausibility of segmentations determined? There are several ways to make comparisons, such as measuring an overlap between segments' sets and comparing their size. However, both of these methods are indirect and do not reveal much about the decisions taken while segmenting. In this study, we provide a new method to assess the properties of segmentations relying on an analysis of the subwords' lengths. Our experiments on English, Finnish and Turkish data show that BPE finds more regularities in longer words, Morfessor tends to identify bigger, less regular chunks, and human annotators optimize segments in longer words so that they are neither too short nor too long.

30.11.2021

Learning to Navigate Virtual Environments with Linguistic, Visual, and Location-based Data

Navigation in the world relies on knowing where you are and what cues to attend to. Vision and Language Navigation (VLN) is a machine learning task where an artificial agent is trained to complete trajectories - presented as language instructions - to reach a destination in a visual environment. In one formulation of VLN proposed by Chen et al. (2019), the environment incorporates visual features from the real world with the aim of evaluating performance when navigating external locations. In this presentation, we explore the contribution of an auxiliary localisation step in completing multi-step navigation routes. Agents are trained to estimate position from text and image pairs depicting real-world places. Location estimation and VLN depend on aligning cues and identifying spatial references in linguistic and visual inputs. We review results from tests on navigating a large-scale urban environment with transfer from localisation and compare this approach to methods developed in prior work on VLN. 

 

Monolingual simple-complex sentence alignment through a learning to rank (LTR) approach

Monolingual alignment is the task of finding a set of correspondence relations between comparable corpora on a document, sentence or token level. The most recent approaches to the task leverage advances in contextual word embeddings and can overcome the limitations of earlier approaches based on surface-level similarity metrics. Additionally, the newer approaches are capable of capturing paraphrases and the context of surrounding sentences. However, applying monolingual alignment on the level of simple and complex sentences reveals the need for an additional mechanism capable of dealing with the asymmetric nature of simple and complex documents. A learning to rank (LTR) approach optimises a list of candidates by ranking them by the relative preference given by a learnable scoring function. In this presentation, I will show the results of experiments aimed at refining monolingual sentence alignment based on cosine similarity or word mover’s distance (WMD) through the use of contextual embeddings and LRT. 

14.12.2021

Eye-tracking based classification of Chinese readers with and without Dyslexia

As a “window on mind and brain” (Van Gompel, 2007), eye movements reflect cognitive processes in reading. Psychological reading research has shown that eye gaze patterns differ between readers with and without dyslexia. In recent years, researchers have attempted to classify readers with dyslexia (Spanish and Swedish) based on their eye movements using Support Vector Machines (SVMs). However, these approaches (i) were based on highly aggregated features averaged over the words of the stimulus sentence and the different trials of a subject (sentences), thus disregarding the sequential nature of the eye movements and eliminating all temporal information present in the input, and (ii) did not consider the linguistic stimulus and its interaction with the reader’s eye movements. In the present work, we propose a series of models (CNNs and LSTMs) that make sequence-based predictions, processing the entire sentence without the necessity of aggregating features over the sentence. Additionally, we incorporate the linguistic stimulus into the model, represented by contextualized word embeddings. The models are evaluated on a Mandarin Chinese dataset, containing eye movements from children with and without dyslexia. The results show that even for a logographic script such as Chinese, the sequence models are able to classify scanpaths on sentences read by children with dyslexia well. Being work in progress, however, these models are still outperformed by our reimplementation of the SVM baseline, presumably because of the very small amounts of training data. During the talk, I will discuss ideas on how to resolve this major challenge.

Transfer learning between similar languages

Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. In this work, we present a simple yet effective strategy to improve cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource parent language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties.