Automatic Classification of Large Document Collections

Student: N/A

Supervisor: Gerold Schneider

Introduction

The aim of this practical programming or bachelor thesis is the classification of large quantities of newspaper articles. The project is run in cooperation with a Swiss media database. There are several interfaces that deliver documents in XML or JSON form. To be investigated are OpenSource tools like e.g. WEKA, Rapidminer, Date or others. In addition to testing various algorithms and methods, their performance evaluation is of central focus, and a comprehensive gold standard is provided to us in the media database. Depending on the time available, specific adjustments to the standard solutions may also be made.

Aim and purpose

Testing of open source tools for document classification
Evaluations
Adjustments where necessary
If the task is successfully completed, it could lead to the construction of a new solution for our industrial partner.

Requirements

Programming knowledge in Python, Perl or Java.
Experience with XML and tools like WEKA is an advantage but not mandatory.
Interest in automatic media content analysis.

Literature

N/A

Department of Computational Linguistics

Quicklinks und Sprachwechsel

Main navigation

Automatic Classification of Large Document Collections

Introduction

Aim and purpose

Requirements

Literature

Weiterführende Informationen

Title