In this paper, we examine the use of Joint Factor Analysis methods on RSR2015 part III (digits), [1]. A tied-mixture HMM is used for segmentation of the utterances into digits, while Joint Factor Analysis and a trainable backend are deployed for feature extraction and LLR calculation, respectively. A novel approach for digit-dependent fusion of UBMcomponent log-likelihood ratios is introduced, yielding the best results so far. The fusion of 5 different JFA features gives an equal-error rate of 3.6%, compared to 6.3% attained by the a baseline GMM-UBM model with score normalization. JFA for feature extraction JFA vs. i-vectors • The text-independent paradigm of i-vector/PLDA has not been successful in text-dependent speakerrecognition. The speaker-phrase variability is hard to be confined into a low-dimensional subspace. • JFA offers the flexibility of confining the channel effects in a subspace while allowing the speaker-phrace factors to lie on the supervector space, [2]. Main JFA equation S = m + Ux + V y + Dz (1) • The hidden variable x varies from one recording to another and is intended to model channel effects. • In text-independent speaker recognition, the term Dz is usually dropped and speakers are characterized by the low-dimensional vector y. Here, we extract either z or y features, [3]. JFA on utterances segmented into digits • JFA can be extended to utterances that are segmented into HMM states (digits). • Features can be global (digit-independent) or local (digit-dependent), supervectors-sized (z-vectors) or subspace (y-vectors). Segmentation and Baum-Welch stats Tied-Mixture HMM • Train a UBMand use its means and covariance matrices as codebook for a Tied-Mixture HMM (TMM) • The TMM has a single Gaussian codebook and digitdependent weights. • Very efficient for training and evaluating (Viterbi algorithm). • We use it also for extracting Baum-Welch stats for local features instead of the UBM. Training and evaluating the system Training the JFA and backend • Train a JFA model using both local and global features, z or y-vectors. (Several combinations are possible.) • Extract z or y-vectors, project them onto the unitsphere). • Train a Joint-Density Backend per feature. Evaluating the model • Apply Viterbi segmentation, extract z or y-vectors and use the JDB to calculate LLRs for each trial. • Apply score normalization and fuse score-normalized LLRs coming from multiple features. Joint-Density Backend An Alternative to PLDA • We model the joint-distribution of pairs of enrollment and test vectors under the same speaker hypothesis, [4]. • We use ”target” trials from the training set t = [ye , y T t ] T . • We estimate mean and covariance matrix (C). Assuming zero mean, C is as follows:
Themos StafylakisMd. Jahangir AlamPatrick Kenny
Nooshin MaghsoodiHossein SametiHossein ZeinaliThemos Stafylakis
Shengyu YaoRuohua ZhouPengyuan Zhang