Alleviating One-to-many Mapping in Talking Head Synthesis with Dynamic Adaptation Context and Style Adapter

Zhaojie Chu1, Kailing Guo1, Xiaofen Xing1, Bolun Cai2, Shan He3, Xiangmin Xu1,
1South China University of Technology, 2ByteDance Inc, 3iFlytek Research.

Abstract

Speech-driven talking head synthesis technology has made remarkable progress, but it still faces the challenge of the one-to-many pathological mapping. The challenge results in inaccurate lip movements, ambiguity in facial expressions, and a lack of coherence during transitions between facial movements. The phenomenon is primarily caused by: (1) for one speaker, the same phoneme corresponds to a wide range of mouth shapes and facial expressions due to contextual variations, and (2) for the same spoken content, different speakers exhibit diverse facial expressions as a result of unique speaking styles.

In this work, we propose a novel framework, called AllTalk, to alleviate one-to-many pathological mapping, which enables more vivid and natural talking head. Specifically, considering the asymmetry and dynamic nature of mouth shapes' dependence on phoneme context, we propose a Dynamic Adaptive Context encoder to accurately capture the context around the phoneme and its dynamics, thereby reducing the ambiguity in mapping speech to facial movements. Moreover, to alleviate the uncertainty caused by individual stylistic differences, we propose a Style Adapter that expands a generic discrete motion space for target speaker. The Style Adapter not only effectively represents general facial motions but also captures the personalized nuances of facial movements. To further enhance the fidelity of output, we introduce a Dynamic Gaussian Renderer based on 3D Gaussian Splatting, capable of producing stable and realistic rendering videos.

Extensive qualitative and quantitative experiments demonstrate that AllTalk surpasses existing state-of-the-art methods, providing an effective solution to the challenge of one-to-many mapping.

Examples of One-to-many Mapping

(a) For one speaker, one phoneme corresponds to multiple alternative mouth shapes due to complex contexts.

(b) For the same phoneme whithin same context, different speakers exhibit distinct mouth shapes, reflecting their unique speaking styles.

-->

Comparison

Comparative visualization of facial movements synthesized by different models for the same phoneme in difference contexts. “/æ/” exhibits distinct mouth shapes when pronouncing the words “sadness” and “and”, and “/m/” displays different facial movements when articulating the words “tomorrow” and “meet”.

Comparative visualization of facial movements synthesized by different models for the same spoken content cross different speakers. For the same word, different speakers displayed diverse mouth shapes.