JOURNAL ARTICLE

Visual-Enhanced End-to-End Neural Diarization

Abstract

Speaker diarization, which targets capturing "who spoke when" labels, has long been treated as the fundamental structure to support audio-related intelligent services. Recently end-to-end neural diarization (EEND) has been proposed and it outperforms the traditional clustering-based approaches in some scenarios. However, EEND treats each block of the recording independently, so it faces the inter-block permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. To deal with the challenge, we propose a novel visual-enhanced end-to-end speaker diarization solution, named AV-EEND. Specifically, as encoders for both audio and visual embedding are integrated, joint speech activity with ambiguous or predetermined speaker orders could be all available. In this way, visual cues will benefit in guiding audio information from the ambiguous speaker order to the predetermined speaker orders, thus solving the inter-block permutation problem. To better accomplish the goal, an audio-visual classifier with visual augmentation, cross-attention, and the self-attention mechanism is employed to further enhance the correlation between vision and audio information. Extensive validations on the public AMI dataset demonstrate that AV-EEND outperforms EEND and the state-of-the-art audio-visual system.

Keywords:
Speaker diarisation Computer science Speech recognition End-to-end principle Ambiguity Speaker recognition Audio visual Block (permutation group theory) Classifier (UML) Embedding Visualization Artificial intelligence Multimedia

Metrics

1
Cited By
0.19
FWCI (Field Weighted Citation Impact)
24
Refs
0.37
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

JOURNAL ARTICLE

End-to-End Audio-Visual Neural Speaker Diarization

Maokui HeJun DuChin‐Hui Lee

Journal:   Interspeech 2022 Year: 2022
JOURNAL ARTICLE

ASR-Aware End-to-End Neural Diarization

Aparna KhareEun‐Jung HanYuguang YangAndreas Stolcke

Journal:   ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Year: 2022 Pages: 8092-8096
JOURNAL ARTICLE

End-to-End Neural Speaker Diarization with Self-Attention

Yusuke FujitaNaoyuki KandaShota HoriguchiYawen XueKenji NagamatsuShinji Watanabe

Journal:   2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Year: 2019
© 2026 ScienceGate Book Chapters — All rights reserved.