Siyuan ShenYu GaoFeng LiuHanyang WangAimin Zhou
The mainstream paradigm of speech emotion recognition (SER) is identifying\nthe single emotion label of the entire utterance. This line of works neglect\nthe emotion dynamics at fine temporal granularity and mostly fail to leverage\nlinguistic information of speech signal explicitly. In this paper, we propose\nEmotion Neural Transducer for fine-grained speech emotion recognition with\nautomatic speech recognition (ASR) joint training. We first extend typical\nneural transducer with emotion joint network to construct emotion lattice for\nfine-grained SER. Then we propose lattice max pooling on the alignment lattice\nto facilitate distinguishing emotional and non-emotional frames. To adapt\nfine-grained SER to transducer inference manner, we further make blank, the\nspecial symbol of ASR, serve as underlying emotion indicator as well, yielding\nFactorized Emotion Neural Transducer. For typical utterance-level SER, our ENT\nmodels outperform state-of-the-art methods on IEMOCAP in low word error rate.\nExperiments on IEMOCAP and the latest speech emotion diarization dataset ZED\nalso demonstrate the superiority of fine-grained emotion modeling. Our code is\navailable at https://github.com/ECNU-Cross-Innovation-Lab/ENT.\n
Yu GaoFeng LiuHanyang WangAimin Zhou
Sharjeel TahirNima MirnateghiSyed Afaq Ali ShahFerdous Sohel
Ryotaro NagaseTakahiro FukumoriYoichi Yamashita
Zheng FangZhen LiuTingting LiuChih‐Chieh Hung