For many of the systems we develop one prerequisite is a module that computes the syntactic structures of sentences. This gives rise to a dilemma. Hand-coding large machine-usable grammars is exceedingly time-consuming. One alternative approach tries to induce grammars from syntactically analysed training corpora. Unfortunately the induced grammars often produce linguistically counterintuitive analyses (or no analyses at all). Combining a hand-coded core grammar with lexical values derived by statistical method from corpora proved a good way out of this dilemma. A large grammar of English, developed as part of a PhD project along these lines, turned out to be a very powerful tool for the analysis of large volumes of text (e.g. the entire British National Corpus).
One pervasive problem of Computational Linguistics is that of ambiguity. Again, hand-coding disambiguation rules turns out to be a practical impossibility. A reliable and efficient solution to this problem is based, again, on the statistical analysis of manually disambiguated texts. This is the topic of several research projects.
Resolution of anaphoric references
Definite noun phrases, above all definite pronouns, can formally almost always refer to several possible textual elements ("antecedents") while only one such relationship is actually intended. Combining rule-based and statistical approaches turns out to give the best results in identifying the intended relationship. This is the topic of an SNF project.