Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB), which compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Experiments conducted on the UCF-HMDB and Kinetics-NEC Drone datasets, with different backbones, like I3D and STAM, show that TransferAttn outperforms state-of-the-art approaches. Also, we demonstrate that our DTAB yields performance gains when applied to other ViT-based methods for video UDA.
André SacilottiSamuel Felipe dos SantosNicu SebeJurandy Almeida
Ximei WangLiang LiWeirui YeMingsheng LongJianmin Wang
Changchun ZhangQingjie ZhaoYu Wang