Active sampling, scaling and dataset merging for large-scale image quality assessment
The field of subjective assessment is concerned with eliciting human judgements about a set of stimuli. Collecting such data is costly and time-consuming, especially when the subjective study is to be conducted in a controlled environment and using a specialized equipment. Thus, data from these studies are usually scarce. One of the areas, for which obtaining subjective measurements is difficult is image quality assessment. The results from these studies are used to develop and train automated or objective image quality metrics, which, with the advent of deep learning, require large amounts of versatile and heterogeneous data.
I present three main contributions in this dissertation. First, I propose a new active sampling method for efficient collection of pairwise comparisons in subjective assessment experiments. In these experiments observers are asked to express a preference between two conditions. However, many pairwise comparison protocols require a large number of comparisons to infer accurate scores, which may be unfeasible when each comparison is time-consuming (e.g. videos) or expensive (e.g. medical imaging). This motivates the use of an active sampling algorithm that chooses only the most informative pairs for comparison. I demonstrate, with real and synthetic data, that my algorithm offers the highest accuracy of inferred scores given a fixed number of measurements compared to the existing methods. Second, I propose a probabilistic framework to fuse the outcomes of different psychophysical experimental protocols, namely rating and pairwise comparisons experiments. Such a method can be used for merging existing datasets of subjective nature and for experiments in which both measurements are collected. Third, with a new dataset merging technique and by collecting additional cross-dataset quality comparisons I create a Unified Photometric Image Quality (UPIQ) dataset with over 4,000 images by realigning and merging existing high-dynamic-range (HDR) and standard-dynamic-range (SDR) datasets. The realigned quality scores share the same unified quality scale across all datasets. I then use the new dataset to retrain existing HDR metrics and show that the dataset is sufficiently large for training deep architectures. I show the utility of the dataset and metrics in an application to image compression that accounts for viewing conditions, including screen brightness and the viewing distance.