Alibaba’s latest innovation TaoAvatar sets new standards for photorealistic 3D avatars in real time and finally makes AR communication suitable for everyday use.
The technology combines 3D Gaussian Splatting (3DGS) with an innovative teacher-student network approach to create fully controllable human avatars. These digital representations not only achieve impressive visual quality, but also run at 90 frames per second on mobile devices such as the Apple Vision Pro – a crucial factor for practical use in AR applications. The avatars follow a parametric SMPLX template with consistent topology, allowing precise control over poses, gestures and facial expressions.
Unlike previous technologies, TaoAvatar only requires multi-view camera sequences as input and achieves 2.4 dB better PSNR image quality than comparable systems. At the same time, the technology reduces memory requirements by 70% compared to NeRF-based approaches.
Technical innovation on several levels
At the heart of the system is a hybrid representation model that combines SMPLX meshes with 3D Gaussian textures. This enables both precise geometric control and convincing dynamic appearances. Particularly noteworthy is the use of a teacher-student framework:
- The StyleUnet teacher network captures high-frequency details through position-based deformation maps
- The MLP student network is optimized for mobile devices and ensures 90 FPS at 2K resolution
To develop the technology, the research team used the TalkBody4D dataset with 59-camera recordings in 20 FPS and 3K×4K resolution. The integration of Audio2BS technology also enables the synchronization of lip movements, facial expressions and gestures with spoken language.
Areas of application and future prospects
The technology developed by Alibaba researchers opens up a wide range of possible applications:
- Life-sizeAR shopping assistants for 3D product demonstrations
- Holographic meetings with emotional expressiveness
- AI customer service with natural body language
Despite this impressive progress, there are still challenges in modeling extreme facial expressions and the high computational cost of initial template creation (approximately 8 hours per avatar). However, with the planned release of the code and dataset via Hugging Face, the technology should soon find wider application.
Ads
Summary:
- TaoAvatar creates photorealistic 3D avatars with consistent topology
- Real-time rendering at 90 FPS on mobile devices and AR headsets
- Hybrid architecture combines 3D Gaussian splatting with parametric models
- 70% memory savings compared to conventional methods
- Applications in e-commerce, AR communication and AI assistance
- Integration of audio-to-facial expression synchronization for natural interactions
Source: Taoavatar