JFA for speaker recognition with random digit strings

Themos Stafylakis; Patrick Kenny; Md. Jahangir Alam; Marcel Kockmann

doi:10.21437/interspeech.2015-82

ScienceGate Book Chapters

JOURNAL ARTICLE

JFA for speaker recognition with random digit strings

Themos Stafylakis Patrick Kenny Md. Jahangir Alam Marcel Kockmann

Year: 2015 Pages: 190-194

DOI: 10.21437/interspeech.2015-82

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we examine the use of Joint Factor Analysis methods on RSR2015 part III (digits), [1]. A tied-mixture HMM is used for segmentation of the utterances into digits, while Joint Factor Analysis and a trainable backend are deployed for feature extraction and LLR calculation, respectively. A novel approach for digit-dependent fusion of UBMcomponent log-likelihood ratios is introduced, yielding the best results so far. The fusion of 5 different JFA features gives an equal-error rate of 3.6%, compared to 6.3% attained by the a baseline GMM-UBM model with score normalization. JFA for feature extraction JFA vs. i-vectors • The text-independent paradigm of i-vector/PLDA has not been successful in text-dependent speakerrecognition. The speaker-phrase variability is hard to be confined into a low-dimensional subspace. • JFA offers the flexibility of confining the channel effects in a subspace while allowing the speaker-phrace factors to lie on the supervector space, [2]. Main JFA equation S = m + Ux + V y + Dz (1) • The hidden variable x varies from one recording to another and is intended to model channel effects. • In text-independent speaker recognition, the term Dz is usually dropped and speakers are characterized by the low-dimensional vector y. Here, we extract either z or y features, [3]. JFA on utterances segmented into digits • JFA can be extended to utterances that are segmented into HMM states (digits). • Features can be global (digit-independent) or local (digit-dependent), supervectors-sized (z-vectors) or subspace (y-vectors). Segmentation and Baum-Welch stats Tied-Mixture HMM • Train a UBMand use its means and covariance matrices as codebook for a Tied-Mixture HMM (TMM) • The TMM has a single Gaussian codebook and digitdependent weights. • Very efficient for training and evaluating (Viterbi algorithm). • We use it also for extracting Baum-Welch stats for local features instead of the UBM. Training and evaluating the system Training the JFA and backend • Train a JFA model using both local and global features, z or y-vectors. (Several combinations are possible.) • Extract z or y-vectors, project them onto the unitsphere). • Train a Joint-Density Backend per feature. Evaluating the model • Apply Viterbi segmentation, extract z or y-vectors and use the JDB to calculate LLRs for each trial. • Apply score normalization and fuse score-normalized LLRs coming from multiple features. Joint-Density Backend An Alternative to PLDA • We model the joint-distribution of pairs of enrollment and test vectors under the same speaker hypothesis, [4]. • We use ”target” trials from the training set t = [ye , y T t ] T . • We estimate mean and covariance matrix (C). Assuming zero mean, C is as follows:

Keywords:

Speech recognition Computer science Segmentation Subspace topology Normalization (sociology) Feature vector Pattern recognition (psychology) Hidden Markov model Feature extraction Artificial intelligence

Metrics

Cited By

2.83

FWCI (Field Weighted Citation Impact)

Refs

0.95

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Music and Audio Processing

Physical Sciences → Computer Science → Signal Processing

JFA for speaker recognition with random digit strings

Abstract

Metrics

Citation History

Topics

Related Documents

Text-Dependent Speaker Recognition With Random Digit Strings

Adversarially Learned Total Variability Embedding for Speaker Recognition with Random Digit Strings

Speaker Recognition With Random Digit Strings Using Uncertainty Normalized HMM-Based i-Vectors

Speaker-Phonetic I-Vector Modeling for Text-Dependent Speaker Verification with Random Digit Strings

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings