Speech-driven 3D facial animation is a challenging cross-modal task that has attracted growing research interest. During speaking activities, the mouth displays strong motions, while the other facial regions typically demonstrate comparatively weak activity levels. Existing approaches often simplify the process by directly mapping single-level speech features to the entire facial animation, which overlook the differences in facial activity intensity leading to overly smoothed facial movements.
In this study, we propose a novel framework, CorrTalk, which effectively establishes the temporal correlation between hierarchical speech features and facial activities of different intensities across distinct regions. A novel facial activity intensity metric is defined to distinguish between strong and weak facial activity, obtained by computing the short-time Fourier transform of facial vertex displacements. Based on the variances in facial activity, we propose a dual-branch decoding framework to synchronously synthesize strong and weak facial activity, which guarantees wider intensity facial animation synthesis. Furthermore, a weighted hierarchical feature encoder is proposed to establish temporal correlation between hierarchical speech features and facial activity at different intensities, which ensures lip-sync and plausible facial expressions.
Extensive qualitatively and quantitatively experiments as well as a user study indicate that our CorrTalk outperforms existing state-of-the-art methods.
CorrTalk first analyses differences in facial activity intensity cross distinct regions. Facial activity intensity is quantified using amplitude values within the fundamental band of the short-time Fourier transform(STFT). Left: activity intesity of a vertex in mouth and forehead region within a motion sequence are shown in (a) and (c) (top row: \(L_{2}\) distance between vertices in the reference sequence and the neutral topology; bottom row: STFT of the vertex displacements.). (b) represents the average facial activity intensity from the training data. Right: dynamics of facial activity intensity in a sequence.
Overview of the proposed CorrTalk. A novel framework for learning the temporal correlation between HSF and facial activities of different intensities uses raw audio as input and generates a sequence of 3D facial animation. The design of the acoustic feature extractor follows wavLM. The weighted hierarchical speech encoder produces frame-, phoneme-, word- and utterance-level speech features, and calculates the importance weight of each level of features for strong and weak facial movements. A dual-branch decoder based on the FAI synchronously generates strong and weak facial movements. After performing STFT of the vertex displacements from training data, a learnable mask \(\mathbf{m}_t \in [0, 1] \) is initialized according to the absolute value of the amplitude in fundamental frequency. \(\mathbf{m}_t (\cdot) \) close to 1 indicates strong facial movements and vice versa for weak movements.
Visual comparison of sampled facial animations generated by different methods on VOCA-Test (left) and BIWI-Test-B (right). The top portion delineates facial animations associated with distinct speech content. The bottom portion displays the synthetic sequence with ground truth mean error.