Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion

Ganesh Sivaraman; Vikramjit Mitra; Hosung Nam; Mark Tiede; Carol Espy-Wilson

doi:10.1121/1.5116130

ScienceGate Book Chapters

JOURNAL ARTICLE

Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion

Ganesh Sivaraman Vikramjit Mitra Hosung Nam Mark Tiede Carol Espy-Wilson

Year: 2019 Journal: The Journal of the Acoustical Society of America Vol: 146 (1)Pages: 316-329 Publisher: Acoustical Society of America

DOI: 10.1121/1.5116130

Get Full-Text PDF Get Analytical Report

Abstract

Speech inversion is a well-known ill-posed problem and addition of speaker differences typically makes it even harder. Normalizing the speaker differences is essential to effectively using multi-speaker articulatory data for training a speaker independent speech inversion system. This paper explores a vocal tract length normalization (VTLN) technique to transform the acoustic features of different speakers to a target speaker acoustic space such that speaker specific details are minimized. The speaker normalized features are then used to train a deep feed-forward neural network based speech inversion system. The acoustic features are parameterized as time-contextualized mel-frequency cepstral coefficients. The articulatory features are represented by six tract-variable (TV) trajectories, which are relatively speaker invariant compared to flesh point data. Experiments are performed with ten speakers from the University of Wisconsin X-ray microbeam database. Results show that the proposed speaker normalization approach provides an 8.15% relative improvement in correlation between actual and estimated TVs as compared to the system where speaker normalization was not performed. To determine the efficacy of the method across datasets, cross speaker evaluations were performed across speakers from the Multichannel Articulatory-TIMIT and EMA-IEEE datasets. Results prove that the VTLN approach provides improvement in performance even across datasets.

Keywords:

Vocal tract Normalization (sociology) Speech recognition Computer science Speaker recognition Speaker diarisation Mel-frequency cepstrum Inversion (geology) Artificial intelligence Pattern recognition (psychology) Feature extraction

Metrics

Cited By

3.07

FWCI (Field Weighted Citation Impact)

Refs

0.93

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Speech Recognition and Synthesis

Physical Sciences → Computer Science → Artificial Intelligence

Speech and Audio Processing

Physical Sciences → Computer Science → Signal Processing

Phonetics and Phonology Research

Social Sciences → Psychology → Experimental and Cognitive Psychology

Unsupervised speaker adaptation for speaker independent acoustic to articulatory speech inversion

Abstract

Metrics

Citation History

Topics

Related Documents

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

Vocal Tract Length Normalization for Speaker Independent Acoustic-to-Articulatory Speech Inversion

Autoregressive Articulatory WaveNet Flow for Speaker-Independent Acoustic-to-Articulatory Inversion

An Investigation on Speaker Specific Articulatory Synthesis with Speaker Independent Articulatory Inversion

Reference speaker selection for kinematic-independent acoustic-to-articulatory-inversion