JOURNAL ARTICLE

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

Abstract

Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e. g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.

Keywords:
Computer science Transformer Speech recognition Spectrogram Phrase Utterance Artificial intelligence Natural language processing Pattern recognition (psychology) Engineering

Metrics

29
Cited By
31.80
FWCI (Field Weighted Citation Impact)
24
Refs
1.00
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Emotion and Mood Recognition
Social Sciences →  Psychology →  Experimental and Cognitive Psychology
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze LiuYutong LinYue CaoHan HuYixuan WeiZheng ZhangStephen LinBaining Guo

Journal:   2021 IEEE/CVF International Conference on Computer Vision (ICCV) Year: 2021 Pages: 9992-10002
JOURNAL ARTICLE

Speech Emotion Recognition Based on Swin-Transformer

Zirou LiaoShaoping Shen

Journal:   Journal of Physics Conference Series Year: 2023 Vol: 2508 (1)Pages: 012056-012056
JOURNAL ARTICLE

Temporal-frequency joint hierarchical transformer with dynamic windows for speech emotion recognition

Yonghong FanHeming HuangHuiyun ZhangZiqi Zhou

Journal:   Engineering Applications of Artificial Intelligence Year: 2025 Vol: 161 Pages: 112152-112152
© 2026 ScienceGate Book Chapters — All rights reserved.