With the number of videos growing rapidly in modern society, automatically recognizing objects from video input becomes increasingly pressing. Videos contain abundant yet noisy information, with easily obtained video-level labels. This paper targets the problem of video-based object recognition, whilst keeping the advantages of videos. We propose a novel algorithm, which only utilizes the weak video-level label in training, iteratively updating the classifier and inferring the object location in each video frame. During testing we obtain more accurate recognition results by inferring the location of the object in the scene. The background and temporal information are also incorporated in the model to improve the discriminability and consistency of recognition in video. We introduce a novel and challenging YouTube dataset to demonstrate the benefits of our method over other baseline methods.
Mingui WangDi CuiLifang WuMeng JianYukun ChenDong WangXu Liu
Dingwen ZhangGuangyu GuoWenyuan ZengLei LiJunwei Han
Yufei WangYongjiang HuAlan Wee‐Chung LiewJunhu Wang