© 2016 ACM. In this work, we show how to co-Train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision -Audio weakly supervises video classification, and the co-Training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.
Yidi JiangRuijie TaoZexu PanHaizhou Li
Jatin KheradiyaC Sandeep ReddyRajesh M. Hegde
Abudukelimu WuerkaixiYou ZhangZhiyao DuanChangshui Zhang
Min HuangWen WangZheyuan LinFiseha B. TesemaShanshan JiJason GuMinhong WanWei SongTe LiShiqiang Zhu