The rapid advancement of unmanned aerial vehicle (UAV) technology has increased the accessibility and capabilities of drones, enabling various applications but also raising serious cybersecurity and public safety concerns. Drones can be misused for surveillance, smuggling, or sabotage—especially in environments where visual or radio frequency (RF)-based detection is limited. In such cases, acoustic detection offers a promising alternative by leveraging the unique sound signatures emitted by drones. This paper proposes a hybrid deep learning approach for drone detection and classification based on acoustic signals. The method combines Conformer-based architectures, which integrate self-attention mechanisms from Transformer models, with traditional deep learning components—Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Convolutional Recurrent Neural Networks (CRNN). These combinations aim to improve both spatial feature extraction and sequential modeling under real-world conditions. We developed and tested three hybrid models: (1) CNN + Conformer, which captures spatial features and long-term dependencies; (2) CRNN + Conformer, combining convolutional, recurrent, and attention-based features for enhanced contextual understanding; and (3) RNN + Conformer, focused on improving sequence learning in noisy environments. A custom dataset of 1,500 drone audio clips was collected using a parabolic microphone under varying environmental conditions. The recordings, which include two commercial drone models (DJI Phantom 4 Pro and DJI Mavic 2 Pro), were augmented to simulate background noise and improve generalization. Experimental results show that the CNN-Conformer model achieved the highest accuracy (98%), outperforming CRNN-Conformer (97%) and RNN-Conformer (88%). The inclusion of self-attention significantly improved detection robustness in noisy settings. This study demonstrates the effectiveness of acoustic-based drone detection using hybrid deep learning models and offers a viable solution where RF and visual-based methods fall short.
Manjia WuWeige XieXiufang ShiPanyu ShaoZhiguo Shi
Amna MazenAshraf SaleemKamyab YazdipazAna Dyreson
Cengizhan YapıcıoğluMehmet DemirciM. Ali Akcayol
Mahdjoubi IssameAbdelhafid Benyounes