JOURNAL ARTICLE

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360$^{\circ }$∘ Videos

Mert CokelekHalit OzsoyNevrez İmamoğluCagri Ozcinarİnci AyhanErkut ErdemAykut Erdem

Year: 2025 Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Vol: 48 (1)Pages: 329-345   Publisher: IEEE Computer Society

Abstract

Omnidirectional videos (ODVs) are redefining viewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360$^\circ$∘ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextually, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer's perspective in spherical scenes. Motivated by the lack of comprehensive datasets for 360$^\circ$∘ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360$^\circ$∘ videos. Towards this aim, we propose two novel saliency prediction models: SalViT360, a vision-transformer-based framework for ODVs equipped with spherical geometry-aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, including our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in predicting viewer attention in 360$^\circ$∘ scenes. Interpreting these results, we suggest that integrating spatial audio cues in the model architecture is crucial for accurate saliency prediction in omnidirectional videos.

Keywords:
Computer vision Artificial intelligence Computer science Audio visual Transformer Computer graphics (images) Pattern recognition (psychology) Engineering Multimedia

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
63
Refs
0.32
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Multisensory perception and integration
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Color perception and design
Social Sciences →  Psychology →  Social Psychology

Related Documents

JOURNAL ARTICLE

Saliency Prediction Network for $360^\circ$ Videos

Youqiang ZhangFeng DaiYike MaHongliang LiQiang ZhaoYongdong Zhang

Journal:   IEEE Journal of Selected Topics in Signal Processing Year: 2019 Vol: 14 (1)Pages: 27-37
BOOK-CHAPTER

Saliency Detection in 360$$^\circ $$ Videos

Ziheng ZhangYanyu XuJingyi YuShenghua Gao

Lecture notes in computer science Year: 2018 Pages: 504-520
BOOK-CHAPTER

Panoramic Vision Transformer for Saliency Detection in 360$$^\circ $$ Videos

Heeseung YunSehun LeeGunhee Kim

Lecture notes in computer science Year: 2022 Pages: 422-439
BOOK-CHAPTER

Visual Quality Assessment for $$360^\circ $$ Videos

Ashutosh Singla

T-labs series in telecommunication services Year: 2023 Pages: 65-84
© 2026 ScienceGate Book Chapters — All rights reserved.