Egocentric gaze estimation represents a challenging and immensely significant task which has promising future applications in areas such as human-computer interaction and AR/VR. In this work, we propose a novel model based on the Video Swin Transformer architecture. Through the introduction of localized inductive bias, our model extracts essential local features from first person videos during the windowed self-attention computation process. Additionally, we approximate the modeling of the global context within the gaze region using a shift window approach. We evaluate our approach on the EGTEA Gaze+ dataset, a publicly available dataset for egocentric activity videos. Experimental results unequivocally demonstrate that our model achieves state-of-the-art performance.
Jiahui ChenJiaxin MaXiwen WangLongzhao HuangYujie Li
Zhang ChengYanxia WangXinliang LiuWei LiangFenglin Huang
Ruijie ZhaoYuhuan WangSihui LuoSuyao ShouPinyan Tang
Ze LiuNing JiaYue CaoYixuan WeiZheng ZhangStephen LinHan Hu
Si‐Ahmed NaasXiaolan JiangStephan SiggYusheng Ji