JOURNAL ARTICLE

SwinVid: Enhancing Video Object Detection Using Swin Transformer

Abstract

What causes object detection in video to be less accurate than it is in still images?Because some video frames have degraded in appearance from fast movement, out-of-focus camera shots, and changes in posture.These reasons have made video object detection (VID) a growing area of research in recent years.Video object detection can be used for various healthcare applications, such as detecting and tracking tumors in medical imaging, monitoring the movement of patients in hospitals and long-term care facilities, and analyzing videos of surgeries to improve technique and training.Additionally, it can be used in telemedicine to help diagnose and monitor patients remotely.Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection.Some of those methods aggregate features on the fullsequence level or from nearby frames.To create feature maps, existing VID techniques frequently use Convolutional Neural Networks (CNNs) as the backbone network.On the other hand, Vision Transformers have outperformed CNNs in various vision tasks, including object detection in still images and image classification.We propose in this research to use Swin-Transformer, a state-of-the-art Vision Transformer, as an alternative to CNN-based backbone networks for object detection in videos.The proposed architecture enhances the accuracy of existing VID methods.The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology.We have demonstrated that our proposed method is efficient by achieving 84.3% mean average precision (mAP) on ImageNet VID using less memory in comparison to other leading VID techniques.The source code is available on the website https://github.com/amaharek/SwinVid.

Keywords:
Computer science Transformer Computer vision Artificial intelligence Engineering Electrical engineering Voltage

Metrics

4
Cited By
2.12
FWCI (Field Weighted Citation Impact)
56
Refs
0.78
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
IoT-based Smart Home Systems
Physical Sciences →  Engineering →  Electrical and Electronic Engineering

Related Documents

JOURNAL ARTICLE

SwinSOD: Salient object detection using swin-transformer

Shuang WuGuangjian ZhangXuefeng Liu

Journal:   Image and Vision Computing Year: 2024 Vol: 146 Pages: 105039-105039
JOURNAL ARTICLE

Two-Stage Underwater Object Detection Network Using Swin Transformer

Jia LiuShuang LiuShujuan XuChangjun Zhou

Journal:   IEEE Access Year: 2022 Vol: 10 Pages: 117235-117247
JOURNAL ARTICLE

Video Swin Transformer

Ze LiuNing JiaYue CaoYixuan WeiZheng ZhangStephen LinHan Hu

Journal:   2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Year: 2022 Pages: 3192-3201
© 2026 ScienceGate Book Chapters — All rights reserved.