JOURNAL ARTICLE

Audio-based multimedia event detection using deep recurrent neural networks

Abstract

Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, nonspeech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds. MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models. In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes" the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.

Keywords:
Computer science Frame (networking) Artificial intelligence Event (particle physics) Artificial neural network Recurrent neural network Task (project management) Set (abstract data type) Speech recognition Pattern recognition (psychology)

Metrics

74
Cited By
7.62
FWCI (Field Weighted Citation Impact)
32
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Audio Event Detection Using Deep Neural Networks

Minkyu LimDong‐Hyun LeeHo-Sung ParkJi‐Hwan Kim

Journal:   Journal of Digital Contents Society Year: 2017 Vol: 18 (1)Pages: 183-190
JOURNAL ARTICLE

Audio Event Classification Using Deep Neural Networks

Minkyu LimDong‐Hyun LeeKwang Ho KimJi‐Hwan Kim

Journal:   Phonetics and Speech Sciences Year: 2015 Vol: 7 (4)Pages: 27-33
JOURNAL ARTICLE

Audio-based snore detection using deep neural networks

Jiali XieXavier AubertXi LongJohannes van DijkBruno ArsenaliPedro FonsecaSebastiaan Overeem

Journal:   Computer Methods and Programs in Biomedicine Year: 2020 Vol: 200 Pages: 105917-105917
© 2026 ScienceGate Book Chapters — All rights reserved.