People talk with diversified styles. For one piece of speech, different\ntalking styles exhibit significant differences in the facial and head pose\nmovements. For example, the "excited" style usually talks with the mouth wide\nopen, while the "solemn" style is more standardized and seldomly exhibits\nexaggerated motions. Due to such huge differences between different styles, it\nis necessary to incorporate the talking style into audio-driven talking face\nsynthesis framework. In this paper, we propose to inject style into the talking\nface synthesis framework through imitating arbitrary talking style of the\nparticular reference video. Specifically, we systematically investigate talking\nstyles with our collected \\textit{Ted-HD} dataset and construct style codes as\nseveral statistics of 3D morphable model~(3DMM) parameters. Afterwards, we\ndevise a latent-style-fusion~(LSF) model to synthesize stylized talking faces\nby imitating talking styles from the style codes. We emphasize the following\nnovel characteristics of our framework: (1) It doesn't require any annotation\nof the style, the talking style is learned in an unsupervised manner from\ntalking videos in the wild. (2) It can imitate arbitrary styles from arbitrary\nvideos, and the style codes can also be interpolated to generate new styles.\nExtensive experiments demonstrate that the proposed framework has the ability\nto synthesize more natural and expressive talking styles compared with baseline\nmethods.\n
rongliang WuYingchen YuFangneng ZhanJiahui ZhangXiaoqin ZhangShijian Lu
Rongliang WuYingchen YuFangneng ZhanJiahui ZhangXiaoqin ZhangShijian Lu
Nicole ChristoffKrasimir TonchevNikolay NeshovAgata ManolovaVladimir Poulkov
Jeongsoo ChoiMinsu KimSe Jin ParkYong Man Ro