Joel ShorAren JansenWei HanDaniel ParkYu Zhang
Many speech applications require understanding aspects beyond the words being\nspoken, such as recognizing emotion, detecting whether the speaker is wearing a\nmask, or distinguishing real from synthetic speech. In this work, we introduce\na new state-of-the-art paralinguistic representation derived from large-scale,\nfully self-supervised training of a 600M+ parameter Conformer-based\narchitecture. We benchmark on a diverse set of speech tasks and demonstrate\nthat simple linear classifiers trained on top of our time-averaged\nrepresentation outperform nearly all previous results, in some cases by large\nmargins. Our analyses of context-window size demonstrate that, surprisingly, 2\nsecond context-windows achieve 96\\% the performance of the Conformers that use\nthe full long-term context on 7 out of 9 tasks. Furthermore, while the best\nper-task representations are extracted internally in the network, stable\nperformance across several layers allows a single universal representation to\nreach near optimal performance on all tasks.\n
Joel ShorSubhashini Venugopalan
J. Muñoz VidalPablo RieraLuciana Ferrer
Yu-An ChungYonatan BelinkovJames Glass