Multi-label classification is a common supervised machine learning problem where each instance is associated with multiple classes. The key challenge in this problem is learning the correlations between the classes. An additional challenge arises when the labels of the training instances are provided by noisy, heterogeneous crowd-workers with unknown qualities. We first assume labels from a perfect source and propose a novel topic model (ML-PA-LDA) where the classes that are present as well as the classes absent generate the latent topics and hence the words. Extensive experimentation on real world datasets reveals the superior performance of the proposed model. We then non-trivially extend our topic model to the scenario where the labels are provided by noisy crowd-workers and refer to this model as ML-PA-LDA-C. With experiments on simulated crowd, the proposed model learns the qualities of the annotators well, even with minimal training data.
Ximing LiJihong OuyangXiaotang Zhou
Ximing LiJihong OuyangXiaotang Zhou
Gang ChenYue PengChongjun Wang