Speaker diarization, which targets capturing "who spoke when" labels, has long been treated as the fundamental structure to support audio-related intelligent services. Recently end-to-end neural diarization (EEND) has been proposed and it outperforms the traditional clustering-based approaches in some scenarios. However, EEND treats each block of the recording independently, so it faces the inter-block permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. To deal with the challenge, we propose a novel visual-enhanced end-to-end speaker diarization solution, named AV-EEND. Specifically, as encoders for both audio and visual embedding are integrated, joint speech activity with ambiguous or predetermined speaker orders could be all available. In this way, visual cues will benefit in guiding audio information from the ambiguous speaker order to the predetermined speaker orders, thus solving the inter-block permutation problem. To better accomplish the goal, an audio-visual classifier with visual augmentation, cross-attention, and the self-attention mechanism is employed to further enhance the correlation between vision and audio information. Extensive validations on the public AMI dataset demonstrate that AV-EEND outperforms EEND and the state-of-the-art audio-visual system.
Aparna KhareEun‐Jung HanYuguang YangAndreas Stolcke
Sergio Izquierdo del AlamoBeltrán LabradorAlicia Lozano-DíezDoroteo T. Toledano
Yusuke FujitaNaoyuki KandaShota HoriguchiYawen XueKenji NagamatsuShinji Watanabe