ISA4D

Pipeline

Overview of our diffusion transformer architecture for 4D human generation. Taking the reference image, SMPL condition, camera poses, and background videos as input. Our framework starts by tokenizing 3D SMPL conditions. In parallel, 2D video tokens are optionally composited with background elements and processed through a cascade of disentangled spatial and temporal transformer blocks, enabling efficient modeling of spatio-temporal relationships. These tokens then seamlessly interact with pose tokens via our Interspatial Transformer Block, facilitating effective 3D-aware conditioning. The generated features are further enhanced through Plücker camera embeddings for precise view control and interact with reference image features through cross attention to ensure consistent identity preservation. The entire framework is optimized using a flow-based diffusion formulation, enabling high-quality 4D human generation with controllable pose, viewpoint, and identity.

Results

Multi-Human Generation

Camera Control Generation

Background Composition Generation

Single Human Generation

Face Generation

Upper-body Generation

Ethics

Our research presents advanced generative AI capabilities for human video synthesis. We firmly oppose the misuse of our technology for generating manipulated content of real individuals. While our model enables the creation and editing of photorealistic digital humans, we strongly condemn any application aimed at spreading misinformation, damaging reputations, or creating deceptive content. We acknowledge the ethical considerations surrounding this technology and are committed to responsible development and deployment that prioritizes transparency and prevents harmful applications.

Citation

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein.
"ISA4D: Interspatial Attention for Efficient 4D Human Video Generation".
ACM Trans. Graph. (Proc. SIGGRAPH) 2025.

@article{shao2024isa4d,
title={ISA4D: Interspatial Attention for Efficient 4D Human Video Generation},
author={Shao, Ruizhi and Xu, Yinghao and Shen, Yujun and Yang, Ceyuan and Zheng, Yang and Chen, Changan and Liu, Yebin and Wetzstein, Gordon},
journal={ACM Transactions on Graphics (TOG)},
year={2025},
publisher={ACM New York, NY, USA}
}

ISA4D: Interspatial Attention for

Efficient 4D Human Video Generation

SIGGRAPH 2025

DEMO Video