Algorithmic Quality Assurance: Theory and Practice

Dr. Thomas Christian van Dijk
Julius-Maximilians-Universität Würzburg, Institut für Informatik, Lehrstuhl für Informatik I, Am Hubland, 97074 Würzburg

The use of crowdsourcing provides a big opportunity for many sciences, such as geographic information science and the digital humanities. A particular application that has seen a big boom of crowd projects is the extraction of information from historical documents, since this involves many problems that continue to elude fully‐automatic solutions. Numerous projects have sprung up, some with notable success (such as GB1900, Transkribus and Maps‐in‐the‐Crowd). However, the use of volunteered information comes with major concerns about data quality.

In this project we investigate novel methods for quality assessment in crowdsourcing projects: not just checking the quality at the end, but viewing the entire crowdsourcing process as a computation where we need to be careful about data quality throughout. In doing so, we open up a range of mathematical and algorithmic techniques, including Bayesian modelling, optimal task selection based on active learning, and sensitivity analysis. Our prior experience indicates that this is not only of theoretical interest, but will indeed result in practically useful systems that outperform ad‐hoc approaches. Building on our current project about smart crowdsourcing for information extraction from old maps, we again take historical material as the main source of our experimental material. However, this time we also consider more modern data, which serves two purposes. Firstly, this brings us closer to the other projects in the priority programme. Secondly, the methodology we have developed so far (and will continue to develop) is generally applicable: we will validate this by applying it more broadly. Fruitful applications include image recognition tasks (such as those encountered in the ENAP and COVMAP projects) and information extraction from microblogs ([EVA‐VGI]).

Through this project, we will arrive at a more refined understanding of how the power of crowdsourcing can be harnessed – by paying specific attention to the quality of data as it moves through a project, and designing algorithmically‐supported processes to guide it.