Ricong HuangWeizhi ZhongGuanbin Li
In the task of talking head generation, it is hard to learn the mapping relationship between generated head image and input audio signal. To tackle this challenge, we propose to learn the mapping relationship between input audio signal and the parameters of three-dimensional morphable face model (3DMM) first, which is easier to learn. Then the parameters of 3DMM are used to guide the generation of high-quality talking head images. Prior works mostly encode audio features from short audio windows, which may influence the accuracy of lip movements sometimes because of the limited context. In this paper, we propose a transformer-based audio encoder to take full use of the long-term context from audio and then predict a sequence of 3DMM parameters accurately. Unlike prior works that only use the 3DMM parameters of expression, rotation and translation, we propose to include the parameters of identity. Since the location of 3D facial mesh point is decided by the expression and identity parameters, it is helpful to supply more subtle control of lip movement by considering the identity parameters. The experimental results reveal that our method ranks first in 4 of the total 11 evaluation metrics, which ranks first in the talking head generation track.
Yizhe ZhuaChunhui ZhangaQiong LiubXi Zhoub
Peng TangHuihuang ZhaoWeiliang MengYaonan Wang
Zhijun XuMingkun ZhangDongyu Zhang
Yahui LiLiejun WangYingfeng YuShengjie Shen