LokiTalk:
Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis

Tianqi Li1, Ruobing Zheng1†, Bonan Li2, Zicheng Zhang2,
Meng Wang1, Jingdong Chen1, Ming Yang1

Corresponding Author.
1Ant Group. 2University of Chinese Academy of Sciences.


Abstract

Despite significant progress in talking head synthesis since the introduction of Neural Radiance Fields (NeRF), visual artifacts and high training costs persist as major obstacles to large-scale commercial adoption. We propose that identifying and establishing fine-grained and generalizable correspondences between driving signals and generated results can simultaneously resolve both problems. Here we present LokiTalk, a novel framework designed to enhance NeRF-based talking heads with lifelike facial dynamics and improved training efficiency. To achieve fine-grained correspondences, we introduce Region-Specific Deformation Fields, which decompose the overall portrait motion into lip movements, eye blinking, head pose, and torso movements. By hierarchically modeling the driving signals and their associated regions through two cascaded deformation fields, we significantly improve dynamic accuracy and minimize synthetic artifacts. Furthermore, we propose ID-Aware Knowledge Transfer, a plug-and-play module that learns generalizable dynamic and static correspondences from multi-identity videos, while simultaneously extracting ID-specific dynamic and static features to refine the depiction of individual characters. Comprehensive evaluations demonstrate that LokiTalk delivers superior high-fidelity results and training efficiency compared to previous methods. The code will be released upon acceptance.

Method

Region-Specific Deformation Fields

Figure 1: The driving signals (audio, pose, eye ratio) participate in the two-stage prediction of face and torso deformation fields, respectively. The mask subsequent to each driving signal represents the cross-attention loss between the driving signal and the corresponding region. A colored cubic grid is used to illustrate the predicted deformation fields, with the internal heat maps indicating the magnitude of the deformation amplitude.

ID-Aware Knowledge Transfer

BibTex

@article{li2024lokitalk,
  title={LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis},
  author={Li, Tianqi and Zheng, Ruobing and Li, Bonan and Zhang, Zicheng and Wang, Meng and Chen, Jingdong and Yang, Ming},
  journal={arXiv preprint arXiv:2411.19525},
  year={2024}
}