Kolloquiumsplan FS 2019

Kolloquium FS 2019: Berichte aus der aktuellen Forschung am Institut, Bachelor- und Master-Arbeiten, Programmierprojekte, Gastvorträge

Zeit/Ort: Circa alle 14 Tage am Dienstag von 10.15 Uhr bis 12.00 Uhr, BIN - 2.A.10 (Binzmühlestrasse 14, ZH-Oerlikon)

Verantwortlich: Martin Volk

Kontakt: Martin Volk

[Subject to changes]

Datum	Vortragende / Thema
19. Feb. 2019	Martin Volk: The Ideal Colloquium in the Age of Divided Attention / Binomial Adverbs in Parsing and Machine Translation
5. March 2019	Raphael Balimann (CL, UZH): NMT for Subtitles Johannes Graën & Gerold Schneider (CL, UZH): From Parallel Corpora to Language Learning Applications
19. March 2019	Matthias Baumgartner (IfI, UZH): Integrating Structured and Unstructured Data Sources via Representation Learning Tanja Samardzic (CorpusLab, UZH): Spatial Prominence and Text Frequency
2. April 2019	Phillip Ströbel (CL, UZH): The Impact of OCR on Topic Models Mathias Müller (CL, UZH): Domain Robustness in Neural Machine Translation
16. April 2019	Susie Xi Rao (ETH): Understanding the Web of Science Using Deep Learning Nora Hollenstein (ETH): Improving NLP with Human Data
30. April 2019	Olga Sozinova (CorpusLab, UZH): Measuring Inflectional and Derivational Complexity on the Universal Dependencies Dataset Samuel Läubli (CL, UZH): Neural Machine Translation: Text Presentation Affects Human Quality Assessment and Translation Performance
14. May 2019	Canceled because of ...
13. May 2019 (16:15h in der Ringvorlesung KI)	Simon Clematide (CL, UZH): Künstliche Intelligenz. Sprache. Empirie.
16. May 2019 (17:15h im IfI-Kollo)	Tobias Kuhn (Free University of Amsterdam): Semantic Publishing with Nanopublications
28. May 2019	Ensieh Davoodijam (Isfahan University, Iran; CL, UZH): Summarization of Biomedical Texts Michi Amsler (CL, UZH): An Introduction to Black Magic Vodoo for Embeddings

Abstracts

19. February 2019

Martin Volk

Title: The ideal colloquium in the age of divided attention

Abstract: Some thoughts about the goals of the colloquium plus discussion.

Title: Binomial Adverbs in Parsing and Machine Translation

Abstract: Binomial adverbs are a subclass of multiword adverbs. They pose special problems for NLP since they sometimes consist of particles that are mostly used as prepositions (e.g. EN: by and large, DE: ab und zu). We present our study on how to identify idiomatic binomial adverbs and how well a selection of EN and DE adverbs are handled by a state-of-the-art dependency parser and an online MT system.

5. March 2019

Raphael Balimann: Domänenadaption neuronaler Übersetzungsmodelle

Abstract: This bachelor thesis examines how a neural translation model that has been trained on out-of-domain data can be improved by adding further text data. A comparison between a small set of high-quality in-domain-data and a larger set of out-of-domain data shows that in-domain data trumps out-of-domain data for both training a translation model from scratch as well as for improving domain relevancy of an existing translation model.
The self-trained translation models and commercial translation systems were evaluated using BLEU on three different corpora. To show the difference between system outputs, a manual comparison was made, showing that the best translation system doesn’t always create the best translation.

Johannes Graën + Gerold Schneider: From Parallel Corpora to Language Learning Applications

Abstract: The use of corpus examples has proven beneficial to language learning. Parallel corpora provide a rich resource for the learner to contrastively explore a foreign language on the basis of examples from authentic language use.

In this talk, we outline the fields of learner corpora, parallel corpus research, and CALL (computer-assisted language learning applications) and present three different approaches we used to benefit language learners.

19. March 2019

Matthias Baumgartner: Integrating heterogeneous data sources via representation learning

Abstract: There is an abundance of data, and a steep increase is expected over the next years. However, there is also an abundance of data sources. While the availability of data is becoming a major driver for innovation, it being spread over many sources is an obstacle for such initiatives. Data integration aims at solving this problem by combining multiple sources into one. So far, research in this area has mostly focused on structured databases. However, the vast majority of data is produced in other formats, like text documents or multimedia content. The question is, how can you integrate sources in heterogeneous formats? The challenge of this task is that each format has a unique way of expressing information about the same real-world object, making them difficult to compare. In this talk, I will show how we approached this problem by learning abstract object representations that can be associated across sources.

Tanja Samardžić: Spatial prominence and text frequency

Abstract: In this talk, we quantify and thus evaluate the relation between text frequency and properties of the outer-text, geographic setting by comparing text frequencies of mountain names to the respective geomorphometric characteristics. We focus on some 2000 unique mountain names that appear some 50,000 times in a large compilation of texts on Swiss alpine history. The results on the full data set suggest only a weak relation: only 5–10% of the variation in the text frequency being explained by the respective geomorphometric characteristics. However, an analysis of multiple scales allows us to identify a Simpson’s Paradox. What appears to be ‘noise’ in the analysis of all mountains in the whole of Switzerland shows significant local signals. Small spatial extents, found all over Switzerland, can show considerably strong correlations between text frequency and spatial prominence, with up to 90% of the total variation explained.

2. April 2019

Mathias Müller: Domain Robustness in Neural Machine Translation

Abstract: Translating text that diverges from the training domain is a key challenge for neural machine translation (NMT). Domain robustness - the generalization of models to unseen test domains - is low compared to statistical machine translation (SMT). We therefore analyze the behaviour of NMT models on out-of-domain test sets and empirically evaluate ways to improve domain robustness.

Our analysis of baseline systems shows that hallucination (translations that are fluent but unrelated to the source) is more pronounced in out-of-domain settings. We expect methods that alleviate the problem of hallucinated translations to indirectly improve domain robustness.

We compare several approaches that 1) directly increase domain robustness (subword regularization) or 2) address closely related problems such as hallucination or undertranslation (coverage models, reconstruction). As a novel contribution, we borrow ideas from defensive distillation to test their potential for increasing domain robustness.

Phillip Ströbel: The Influence of OCR Errors on Topic Modeling

Abstract: Topic modeling has turned out to be a popular means for historians and other researchers in the social sciences to study phenomena of their interest. The data they work with is rarely free from errors, especially if the texts have been digitised. A special challenge pose texts that were published in black letter and are poorly OCRised.

The question a researcher needs to ask herself quickly arises: do I need to invest a lot of time to clean the data or will topic modeling algorithms be insensitive to the somewhat dirty input? Since the influence of OCR errors has not yet been studied systematically, I will present an experiment which investigates the susceptibility of topic modeling algorithms to OCR errors in historical newspapers.

16. April 2019

Susie Xi Rao (ETH): Understanding the Web of Science Using Deep Learning

The research goal of this project is to understand the interconnectedness of scientific disciplines using deep learning networks, where we model the dissimilarity and similarity amongst disciplines and their mutual influences (interdisciplinary) in a hierarchical network architecture. We obtained data from Microsoft Academic Graph with 160 million scholarly publications and have devised a three-layer classification system inspired by Hierarchical Deep Learning for Text Classification (HDLTex) to classify the publications across various disciplines with the granularity defined in the standard (hierarchical) classification system in each discipline, e.g., JEL for Economics, ACM for Computer Science (CS), etc.

We have implemented a two-level classification system which renders high accuracy of each sub-model (on average 90% after five epochs using a feedforward neural network (FNN) + recurrent neural network (RNN)) that captures the interconnectedness of sub-domains (e.g., database, hardware) in the discipline (e.g., in Computer Science). The architecture will be extended to a three-level system.

Nora Hollenstein: (ETH): Improving NLP with Human Data

When we read, our brain processes language and generates cognitive processing data such as gaze patterns and brain activity. These signals can be recorded while reading. Cognitive language processing data such as eye-tracking features have shown improvements on various NLP tasks. We analyzed whether using such human features can show consistent improvement across tasks and data sources. In this talk, I present an extensive investigation of the benefits and limitations of using cognitive processing data for NLP, ranging from data collection to past and current projects in which we use human data for information extraction, word embedding evaluation and language modelling.

30. April 2019

Olga Sozinova: Measuring Inflectional and Derivational Complexity on the Universal Dependencies Dataset

In this talk, I present ongoing research on measuring morphological complexity using Shannon entropy. By comparing the unigram entropy for original, lemmatized and segmented texts, we measure how much information is contained in inflections versus other processes of word formation, i.e. derivations and compounding. The Universal Dependencies dataset gives both the original word tokens and lemmas (where inflection is neutralized) of running texts for a range of languages. To further neutralize other word formation processes, we plan to apply state-of-the-art segmentation algorithms taking lemmatized text as input. By now, we applied the unsupervised package Morfessor, and calculated unigram entropy for the three versions of the texts. I will present the procedure and discuss the results of my experiments.

Samuel Läubli: Neural Machine Translation: Text Presentation Affects Human Quality Assessment and Translation Performance

The use of neural networks has led to astounding progress in machine translation (MT). Quality expectations have surged, not least because companies claim to have “cracked” Russian to English MT (SDL) and that “quality is at human parity when compared to professional human translations” (Microsoft; Hassan et al., 2018). Our reassessment of Microsoft's evaluation shows that their finding of human–machine parity is owed to the evaluation design: human judges do not prefer professional translation over MT when rating isolated sentences, but have a strong preference for human translations when evaluating full documents.

The ways in which machine translated text is presented to people are primarily influenced by technical considerations rather than human factors. In quality evaluation, texts are segmented because most MT systems do not take inter-sentential context into account. This does neither incentivise developers to build document-level MT systems, nor allow quality raters to reward textual cohesion and coherence. Similarly, the presentation of MT in post-editing interfaces for human translators are influenced by the needs of computer aids, namely sentence-oriented translation memory and MT systems. In a controlled experiment with 20 professional translators, we found that text presentation significantly affects translation speed and quality. Our results challenge the design choices in the most wide-spread commercial post-editing interfaces.

28. May 2019

Ensieh Davoodijam: Summarization of Biomedical Texts

Recently, with the rapid growth of the scientific literature in the biomedical domain, it has become very important to provide improved mechanisms to extract relevant information quickly and most efficiently. Text summarization is the process of identifying the most important meaningful information in a single document or set of related documents. We use a graph-based model that has three steps: 1) graph creation, 2) graph clustering and 3) selection sentences. At first our algorithm creates a multi-layer graph based on semantic similarity, word similarity, and co-reference similarity. We use the Unified Medical Language System (UMLS) for semantic grounding, and extract concepts and relationships from the biomedical text, using three different tools: MetaMap, OGER, and SemRep. Then clusters of sentences are created based on the Leiden algorithm. We select sentences based on three different heuristics and perform an evaluation over summaries of 100 papers using the ROUGE metrics.

Institut für Computerlinguistik

Quicklinks und Sprachwechsel

Hauptnavigation

Kolloquiumsplan FS 2019

Abstracts