Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.
Figure 1: The driving signals (audio, pose, eye ratio) participate in the two-stage prediction of face and torso deformation fields, respectively. The mask subsequent to each driving signal represents the cross-attention loss between the driving signal and the corresponding region. A colored cubic grid is used to illustrate the predicted deformation fields, with the internal heat maps indicating the magnitude of the deformation amplitude.
Figure 2: The blue modules are the common correspondences among multiple identities, comprising dynamic (light blue) and static (dark blue) correspondences. The colored modules are dynamic (facial actions) and static information (geometry and appearance) of individual identities. During the pre-training (entire yellow panel), both upper and lower parts are trained simultaneously on multi-ID data, allowing the model to learn universal information while extracting individual information. When fine-tuning, the lower half will continue training based on the id-aware initialization parameters obtained from the ID-Encoder.
@article{li2024lokitalk,
title={LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis},
author={Li, Tianqi and Zheng, Ruobing and Li, Bonan and Zhang, Zicheng and Wang, Meng and Chen, Jingdong and Yang, Ming},
journal={arXiv preprint arXiv:2411.19525},
year={2024}
}