Recently, few-shot action recognition receives increasing attention and achieves remarkable progress. However, previous methods mainly rely on limited unimodal data (e.g., RGB frames) while the multimodal information remains relatively underexplored. In this paper, we propose a novel Active Multimodal Few-shot Action Recognition (AMFAR) framework, which can actively find the reliable modality for each sample based on task-dependent context information to improve few-shot reasoning procedure. In meta-training, we design an Active Sample Selection (ASS) module to organize query samples with large differences in the reliability of modalities into different groups based on modality-specific posterior distributions. In addition, we design an Active Mutual Distillation (AMD) to capture discriminative task-specific knowledge from the reliable modality to improve the representation learning of unreliable modality by bidirectional knowledge distillation. In meta-test, we adopt Adaptive Multimodal Inference (AMI) to adaptively fuse the modality-specific posterior distributions with a larger weight on the reliable modality. Extensive experimental results on four public benchmarks demonstrate that our model achieves significant improvements over existing unimodal and multimodal methods.
Pengteng LiYing HeF. Richard YuPinhao SongXingchen ZhouGuang ZhouXiaobai Li
Weijia FengYichen ZhuRui ZhangChenyang WangFei MaXiaobao WangXiao‐Bai Li
Xinzhe NiYong LiuHao WenYatai JiJing XiaoYujiu Yang
Masashi HatanoRyo HachiumaRyo FujiiHideo Saitô
Jinxing PanXiaoshan YangYi HuangChangsheng Xu