JOURNAL ARTICLE

SwinGaze: Egocentric Gaze Estimation with Video Swin Transformer

Abstract

Egocentric gaze estimation represents a challenging and immensely significant task which has promising future applications in areas such as human-computer interaction and AR/VR. In this work, we propose a novel model based on the Video Swin Transformer architecture. Through the introduction of localized inductive bias, our model extracts essential local features from first person videos during the windowed self-attention computation process. Additionally, we approximate the modeling of the global context within the gaze region using a shift window approach. We evaluate our approach on the EGTEA Gaze+ dataset, a publicly available dataset for egocentric activity videos. Experimental results unequivocally demonstrate that our model achieves state-of-the-art performance.

Keywords:
Gaze Computer science Transformer Artificial intelligence Computer vision Computation Human–computer interaction Algorithm Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
19
Refs
0.19
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Gaze Tracking and Assistive Technology
Physical Sciences →  Computer Science →  Human-Computer Interaction
Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Retinal Imaging and Analysis
Health Sciences →  Medicine →  Radiology, Nuclear Medicine and Imaging

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.