End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu H; Qianyu Zhou; Xiangtai Li; Li Niu; Guangliang Cheng; Xiao Li; Wenxuan Liu; Yunhai Tong; Lizhuang Ma; Liqing Zhang

doi:10.1145/3474085.3475285

ScienceGate Book Chapters

JOURNAL ARTICLE

End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu H Qianyu Zhou Xiangtai Li Li Niu Guangliang Cheng Xiao Li Wenxuan Liu Yunhai Tong Lizhuang Ma Liqing Zhang

Year: 2021 Pages: 1507-1516

DOI: 10.1145/3474085.3475285

Get Full-Text PDF Get Analytical Report

Abstract

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection.

Keywords:

Computer science Encoder Transformer Object detection Artificial intelligence ENCODE Computer vision Pattern recognition (psychology) Engineering

Metrics

Cited By

8.28

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

End-to-End Video Object Detection with Spatial-Temporal Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

End-to-End Video Gaze Target Detection with Spatial-Temporal Transformers

End‐To‐End Multiple Object Detection and Tracking With Spatio‐Temporal Transformers

End-to-End Referring Video Object Segmentation with Multimodal Transformers

End-to-End Spatio-Temporal Action Localisation with Video Transformers