JOURNAL ARTICLE

On the Use of Cross-module Attention Statistics Pooling for Speaker Verification

Abstract

In deep learning-based speaker verification frameworks, extraction of a speaker embedding vector plays a key role. In this contribution, we propose a hybrid neural network that employs a cross-module attention pooling mechanism for the extraction of speaker discriminant utterance-level embeddings. In particular, the proposed system incorporates a 2D-Convolution Neural Network (CNN)-based feature extraction module in cascade with a frame-level network, which is composed of a fully Time Delay Neural Network (TDNN) network and a TDNN-Long Short Term Memory (TDNN-LSTM) hybrid network in a parallel manner. The proposed system also employs cross-module attention statistics pooling for aggregating the speaker information within an utterance-level context by capturing the complementarity between two parallelly connected modules. We conduct a set of experiments on the Voxceleb corpus for evaluating the performance of the proposed system and the proposed hybrid network is able to provide better results than the conventional approaches trained on the same dataset.

Keywords:
Computer science Pooling Artificial neural network Speech recognition Artificial intelligence Time delay neural network Feature extraction Pattern recognition (psychology) Convolutional neural network Speaker recognition

Metrics

1
Cited By
0.26
FWCI (Field Weighted Citation Impact)
31
Refs
0.54
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis
Physical Sciences →  Computer Science →  Artificial Intelligence
Speech and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
© 2026 ScienceGate Book Chapters — All rights reserved.