Hyodong LeeJoonseok LeeJoe Yue-Hei NgPaul Natsev
Representation learning is widely applied for various tasks on multimedia data, e.g., retrieval and search. One approach for learning useful representation is by utilizing the relationships or similarities between examples. In this work, we explore two promising scalable representation learning approaches on video domain. With hierarchical graph clusters built upon video-to-video similarities, we propose: 1) smart negative sampling strategy that significantly boosts training efficiency with triplet loss, and 2) a pseudo-classification approach using the clusters as pseudo-labels. The embeddings trained with the proposed methods are competitive on multiple video understanding tasks, including related video retrieval and video annotation. Both of these proposed methods are highly scalable, as verified by experiments on large-scale datasets.
Marwa BadrouniChaker KatarWissem Inoubli
Mengying JiangGuizhong LiuYuanchao SuWeiqiang JinBiao Zhao
Pingtao DuanXiangsheng RenYuting Liu
Xin LiuXunbin XiongMingyu YanRunzhen XueShirui PanSongwen PeiLei DengXiaochun YeDongrui Fan