Automatic Classification of Large Document Collections

Student: N/A

Supervisor: Gerold Schneider


The aim of this practical programming or bachelor thesis is the classification of large quantities of newspaper articles. The project is run in cooperation with a Swiss media database. There are several interfaces that deliver documents in XML or JSON form. To be investigated are OpenSource tools like e.g. WEKA, Rapidminer, Date or others. In addition to testing various algorithms and methods, their performance evaluation is of central focus, and a comprehensive gold standard is provided to us in the media database. Depending on the time available, specific adjustments to the standard solutions may also be made.

Aim and purpose

  • Testing of open source tools for document classification
  • Evaluations
  • Adjustments where necessary
  • If the task is successfully completed, it could lead to the construction of a new solution for our industrial partner.


  • Programming knowledge in Python, Perl or Java.
  • Experience with XML and tools like WEKA is an advantage but not mandatory.
  • Interest in automatic media content analysis.