Visual-Enhanced End-to-End Neural Diarization

Qiyue Qiu; Tong Xu; Enhong Chen

doi:10.1145/3531232.3531250

ScienceGate Book Chapters

JOURNAL ARTICLE

Visual-Enhanced End-to-End Neural Diarization

Qiyue Qiu Tong Xu Enhong Chen

Year: 2022 Pages: 125-130

DOI: 10.1145/3531232.3531250

Get Full-Text PDF Get Analytical Report

Abstract

Speaker diarization, which targets capturing "who spoke when" labels, has long been treated as the fundamental structure to support audio-related intelligent services. Recently end-to-end neural diarization (EEND) has been proposed and it outperforms the traditional clustering-based approaches in some scenarios. However, EEND treats each block of the recording independently, so it faces the inter-block permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. To deal with the challenge, we propose a novel visual-enhanced end-to-end speaker diarization solution, named AV-EEND. Specifically, as encoders for both audio and visual embedding are integrated, joint speech activity with ambiguous or predetermined speaker orders could be all available. In this way, visual cues will benefit in guiding audio information from the ambiguous speaker order to the predetermined speaker orders, thus solving the inter-block permutation problem. To better accomplish the goal, an audio-visual classifier with visual augmentation, cross-attention, and the self-attention mechanism is employed to further enhance the correlation between vision and audio information. Extensive validations on the public AMI dataset demonstrate that AV-EEND outperforms EEND and the state-of-the-art audio-visual system.

Keywords:

Speaker diarisation Computer science Speech recognition End-to-end principle Ambiguity Speaker recognition Audio visual Block (permutation group theory) Classifier (UML) Embedding Visualization Artificial intelligence Multimedia

Metrics

Cited By

0.19

FWCI (Field Weighted Citation Impact)

Refs

0.37

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Visual-Enhanced End-to-End Neural Diarization

Abstract

Metrics

Citation History

Topics

Related Documents

End-to-End Audio-Visual Neural Speaker Diarization

ASR-Aware End-to-End Neural Diarization

End-To-End Neural Speaker Diarization Through Step-Function

Efficient Transformers for End-to-End Neural Speaker Diarization

End-to-End Neural Speaker Diarization with Self-Attention