VersaAnimator:
Versatile Multimodal Controls for Expressive Talking Human Animation

Zheng Qin^#¹, Ruobing Zheng^#^*², Yabing Wang¹, Tianqi Li²,
Zixin Zhu³, Sanping Zhou¹, Ming Yang², Le Wang^†¹

^†Corresponding Author.
^*Project Lead.
^#Co-first authors.
¹Xi’an Jiaotong University. ²Ant Group. ³University at Buffalo.

Abstract

In filmmaking, directors typically allow actors to perform freely based on the script before providing specific guidance on how to present key actions. AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ``directly guided'' through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. Specifically, we design a motion generator that produces basic rhythmic movements from audio input and supports text-prompt control for specific actions. The generated whole-body 3D motion tokens can animate portraits of various scales, producing talking heads, half-body gestures and even leg movements for whole-body images. Besides, we introduce a multi-modal controlled video diffusion that generates photorealistic videos, where speech signals govern lip synchronization, facial expressions, and head motions while body movements are guided by the 2D poses. Furthermore, we introduce a token2pose translator to smoothly map 3D motion tokens to 2D pose sequences. This design mitigates the stiffness resulting from direct 3D to 2D conversion and enhances the details of the generated body movements. Extensive experiments shows that VersaAnimator synthesizes lip-synced and identity-preserving videos while generating expressive and semantically meaningful whole-body motions.

Gallery

Visual illustration of text control for customizing the character’s motion in the generated video

Results on Multi-Animate with different audio and reference images ranging from head to whole-body.

BibTex