Overview of our diffusion transformer architecture for 4D human generation. Taking the reference image, SMPL condition, camera poses, and background videos as input. Our framework starts by tokenizing 3D SMPL conditions. In parallel, 2D video tokens are optionally composited with background elements and processed through a cascade of disentangled spatial and temporal transformer blocks, enabling efficient modeling of spatio-temporal relationships. These tokens then seamlessly interact with pose tokens via our Interspatial Transformer Block, facilitating effective 3D-aware conditioning. The generated features are further enhanced through Plücker camera embeddings for precise view control and interact with reference image features through cross attention to ensure consistent identity preservation. The entire framework is optimized using a flow-based diffusion formulation, enabling high-quality 4D human generation with controllable pose, viewpoint, and identity.
@article{shao2024isa4d, title={ISA4D: Interspatial Attention for Efficient 4D Human Video Generation}, author={Shao, Ruizhi and Xu, Yinghao and Shen, Yujun and Yang, Ceyuan and Zheng, Yang and Chen, Changan and Liu, Yebin and Wetzstein, Gordon}, journal={ACM Transactions on Graphics (TOG)}, year={2025}, publisher={ACM New York, NY, USA} }