JOURNAL ARTICLE

FPDT: a multi-scale feature pyramidal object detection transformer

Kailai HuangMi WenChen WangLina Ling

Year: 2023 Journal:   Journal of Applied Remote Sensing Vol: 17 (02)   Publisher: SPIE

Abstract

Object detection is a fundamental part of autonomous driving algorithms, and with the promotions of transformers in a couple of years, numerous computer vision tasks are integrating transformers into object detectors to acquire a better generalization ability. Building a pure transformer-based detector seems to be a wonderful choice; however, transformers are not omnipotent, and they come with painful drawbacks. Its fundamental operator, multi-head self-attention (MHSA), suffers from the need for computational resources due to its quadratic complexity, which demands an unreasonably high memory usage and critically low throughput. To address this issue, we use a convolution operation to simulate MHSA from transformers by referencing the philosophy and principle of MHSA and making an application migration on convolutional neural networks (CNNs). This gives a detector with power and speed simultaneously. Furthermore, a multi-scale pyramidal feature extractor gives the detector a better view over various scales. In general, our proposed object detector mainly follows the philosophy of attention mechanism, which is implemented by a multi-scale feature pyramidal CNN encoder that simulates the transformer, and a real transformer query neck to extract all of the objects once and, eventually, feed them to the output heads. After training on the COCO2017 dataset, by combining the construction philosophy of the object detector and the philosophy and characteristics of the transformer, our FPDT-Tiny gives an average precision (AP) of up to 34.1 in 150 lower epochs, which is 16.0 and 10.8 higher than CNN-based YOLOv3-Base and SSD-300, respectively. Also, the AP given by our FPDT-Small is up to 37.7 under the same epoch, which is 10.4 and 7.9 higher than the transformer-based detector YOLOS-Small and DETR-ResNet-152, respectively, also demonstrating a comparable performance.

Keywords:
Computer science Detector Transformer Convolutional neural network Encoder Artificial intelligence Object detection Computer vision Pattern recognition (psychology) Electrical engineering Voltage Engineering

Metrics

4
Cited By
0.73
FWCI (Field Weighted Citation Impact)
0
Refs
0.65
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Domain Adaptation and Few-Shot Learning
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.