In this paper, we propose an end-to-end perceptual robust hashing scheme for video copy detection based on unsupervised learning. Firstly, the spatio-temporal information in videos is effectively fused and condensed into high-dimensional features through a 3D self-attention, multi-scale feature fusion model based on 3D-CNN, in which the Inception block and the 3D self-attention mechanism are integrated. Then, we calculate the correlation distances between the extracted features to differentiate perceptual contents. Based on the similarity relationship, we can dynamically generate the pseudo-labels and exploit them to further guide the model training for video hash generation. In addition, we design the dual constraints to make the hash code obtain satisfactory robustness and discrimination. Extensive experiments demonstrate that the proposed scheme achieves superior performance of copy detection compared with existing schemes and performs well even in the case of untrained manipulations.
Zixuan YuXiaoping LiangLv ChenXianquan ZhangZhenjun Tang
Mengzhu YuZhenjun TangH. L. ZhuangXiaoping LiangZhixin LiXianquan Zhang
Zhao Yu-xinGuangjie LiuYuewei DaiZhiquan Wang
Zixuan YuXiaoping LiangChen LvXianquan ZhangZhenjun Tang