Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Generative AI & LLMs

Published: arXiv: 2602.18422v1

Authors

Linxi Xie Lisong C. Sun Ashley Neall Tong Wu Shengqu Cai Gordon Wetzstein

Abstract

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

Paper Summary

Problem

The main challenge addressed by this research paper is the limitation of current video world models in extended reality (XR) that can only respond to coarse control signals such as text or keyboard input, making them unsuitable for embodied interaction. This restricts their utility in applications like healthcare, education, and entertainment where interactive and immersive experiences are crucial.

Key Innovation

The key innovation of this paper is the introduction of a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. This model enables dexterous hand-object interactions and allows users to control the virtual environment in a more natural and intuitive way. The researchers also propose an effective mechanism for 3D head and hand control, which is a significant improvement over existing diffusion transformer conditioning strategies.

Practical Impact

This research has the potential to revolutionize the field of extended reality by enabling the creation of interactive and immersive virtual environments that can be controlled by users in a zero-shot manner. This could have a significant impact on various industries such as healthcare, education, and entertainment, where users can acquire skills and practice complex tasks in a more engaging and effective way.

Analogy / Intuitive Explanation

Imagine playing a video game where you can control the character's movements and actions with your own body. You can wave your hand to pick up an object, or move your head to look around. This is similar to what the researchers have achieved in this paper, where they have developed a system that can generate a virtual environment in real-time based on a user's head and hand movements. This system has the potential to create immersive and interactive experiences that feel more natural and intuitive.

Paper Information

Categories:

cs.CV

Published Date:

arXiv ID:

2602.18422v1

Quick Actions

Back to Home