AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text


Abstract

Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high-resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation.

pipeline

Response videos

We add addional video examples for the rebuttal demos.

Additional Examples

Comparison between different SDS guidances

We compare avatar results generated using different SDS guidances, i.e., Stable Diffusion (left), Skeleton-based ControlNet (middle) and DensePose-conditional ControlNet (right) guidances.

Stable Diffusion
Skeleton-based ControlNet
DensePose-conditional ControlNet
Albus Dumbledore
Wolverine, Marvel Character
Zeus

More animation results

Our AvatarStudio can achieve plausible animation results on loose clothing to a certain extent, such as a person wearing ballet costume. By further introducing non-rigid deformation module into the model, AvatarStudio can animate persons with more challenging clothes, e.g., skirts and dresses.


Original AvatarStudio

A ballerina

AvatarStudio with non-rigid deformation

A woman wearing a short jean skirt and a cropped top
A pregnant person of color

Comparison between different 3D representations

We compare avatar results generated using different 3D representations, i.e., DMTet-only (left), NeRF-only (middle) and ours (right) representations. The NeRF-only representation, despite achieving reasonable results, still struggles with issues like coarse faces and noisy geometry with floating artifacts.

DMTet
NeRF
Ours
Albert Einstein
Harry Potter

Avatar creation with more complicated prompts

AvatarStudio has shown promising results, effectively aligning the generated avatars with the detailed descriptions of the complex prompts.

Elderly woman, dressed in a traditional Native American outfit, holding dream catchers, braided hair
Cute chibi Lara Croft, game, Pixar design, studio lighting, modern Disney style, 3D character
Chibi Thor with Mjolnir, cute, volumetric lighting, reflective textures, game, character
Medieval solder holding two longswords on hands, fantasy, game, character
Tesla trooper, wearing Mecha suit, scifi, game character, unreal, 3D rendering, fantasy
Chibi, single boy, cute, magician's outfit, top hat, magic wand, curly hair, shiny shoes
Young man, dressed in a futuristic cyberpunk outfit, neon accents, holding a high-tech gadget
Elderly gentleman, dressed in a vintage suit, monocle, holding walking canes on hands
Teenage boy, dressed in a modern hip-hop style, baseball cap tilted, holding basketballs
Chibi, 1boy, cute, knight armor, helmet, holding toy knife on hands, Pixar design
Elderly man, dressed in a traditional samurai outfit, holding katana
Chibi, 1girl, hanfu, cat ears, cat girl, silk robe, wavy hair, wearing traditional sandals
Stealthy hinja holding dual katanas, 3D, game character, unreal
A little girl dressed as Wonder Woman, chibi style, volumetric lighting, Disney style
Strong Slayer, holding machete on hands, game character, 3D rendering, unreal
Cute chibi Son Goku, Sporty style outfit, shoes, nike jacket, little boy, cartoon

Generations with fewer optimization steps

We conduct a comparative analysis of the avatar generation results that are achieved with fewer optimization steps. Left: the results obtained with reduced optimization steps. Right: original results. We see the model, even when optimized with fewer steps, can still yield results that are comparable to the original ones.

Fewer steps
Default
Fewer steps
Default
Abraham Lincoln
Messi
Harry Potter
A karate master wearing a black belt

Dual-space training

We compare the animation results of models trained without (left) and with (right) dual-space training strategies. By discarding dual-space training, certain floating artifacts can be observed around the hip and shoulder regions during the animation process.

w/o dual-space
w/ dual-space
w/o dual-space
w/ dual-space
Captain America
Iron Man
Hulk, Marvel Character
Policewoman

Example generated avatars

AvatarStudio generates high-quality avatars in a multi-view consistent way.

Messi
Bruce Lee
Donald Trump
Kim Kardashian
Terracotta Warriors
Albert Einstein
A man wearing kilt
Captain America
A chef wearing in white
A man with dreadlocks
Lara Croft in Tomb Raider
A karate master wearing a black belt
A professional boxer
A man with curly hair wearing glasses
An American football player
Wolfgang Amadeus Mozart
Harry Potter
Michael Jackson
A ninja
Abraham Lincoln

Comparison Results

We compare AvatarStudio with other text-guided generation methods.

DreamFusion

Magic3D-Fine

DreamAvatar

DreamWaltz

Ours

Assassin Creed

A standing Captain Jack Sparrow from Pirates of the Caribbean

DreamFusion

Magic3D-Fine

DreamHuman

AvatarVerse

Ours

A man wearing a bomber jacket

A karate master wearing a black belt


Multimodal Avatar Animation

AvatarStudio provides high-quality and easy-to-use animation, allowing users to drive the generated avatars with multimodal signals, such as text or video.

Text-driven animation. We adopt MDM to convert text prompts, like "A person is punching a bag", into SMPL sequences for animation.

Video-driven animation. We use VIBE to estimate SMPL sequences from driving videos for animation.

Stylized avatar creation

AvatarStudio supports stylized avatar creation by simply providing an additional style image.

Style image
A chef
A karate master
A girl wearing skirt
Style image
Kratos
A karate master
Gintoki
Style image
A girl wearing dress
A karate master
A girl wearing skirt
Style image
A chef
A karate master
A ninja

Citation

@article{anon2023avatarstudio,
  author = {Anonymous},
  title = {AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text},
  joural = {OpenReview},
  year = {2023},
}