AnyView: Synthesizing Any Novel View in Dynamic Scenes

Generative AI & LLMs
Published: arXiv: 2601.16982v1
Authors

Basile Van Hoorick Dian Chen Shun Iwase Pavel Tokmakov Muhammad Zubair Irshad Igor Vasiljevic Swati Gupta Fangzhou Cheng Sergey Zakharov Vitor Campagnolo Guizilini

Abstract

Modern generative video models excel at producing convincing, high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments. In this work, we introduce \textbf{AnyView}, a diffusion-based video generation framework for \emph{dynamic view synthesis} with minimal inductive biases or geometric assumptions. We leverage multiple data sources with various levels of supervision, including monocular (2D), multi-view static (3D) and multi-view dynamic (4D) datasets, to train a generalist spatiotemporal implicit representation capable of producing zero-shot novel videos from arbitrary camera locations and trajectories. We evaluate AnyView on standard benchmarks, showing competitive results with the current state of the art, and propose \textbf{AnyViewBench}, a challenging new benchmark tailored towards \emph{extreme} dynamic view synthesis in diverse real-world scenarios. In this more dramatic setting, we find that most baselines drastically degrade in performance, as they require significant overlap between viewpoints, while AnyView maintains the ability to produce realistic, plausible, and spatiotemporally consistent videos when prompted from \emph{any} viewpoint. Results, data, code, and models can be viewed at: https://tri-ml.github.io/AnyView/

Paper Summary

Problem
The main problem this paper addresses is the challenge of generating new videos from arbitrary camera perspectives in dynamic scenes. Current video generation models excel at producing high-quality outputs, but struggle to maintain multi-view and spatiotemporal consistency in highly dynamic real-world environments.
Key Innovation
The innovation in this work is the introduction of AnyView, a diffusion-based video generation framework that can synthesize any novel view in dynamic scenes with minimal inductive biases or geometric assumptions. AnyView operates end-to-end, without explicit scene reconstruction or expensive test-time optimization techniques.
Practical Impact
This research has the potential to impact various fields, including dynamic scene reconstruction, world models, robotics, self-driving, and more. AnyView can generate realistic, temporally stable, and self-consistent videos across large viewpoint changes, making it a useful tool for applications where camera poses may shift. This could improve the performance of robots, autonomous vehicles, and virtual reality systems, among others.
Analogy / Intuitive Explanation
Imagine you're watching a movie and suddenly the camera zooms in or out, or changes perspective. AnyView is like a superpower that allows video generation models to "see" the scene from any angle, even if the camera didn't capture it directly. It's like having a mental "re-projection" capability that infers likely layouts, object shapes, and scene completions from limited information, just like humans do.
Paper Information
Categories:
cs.CV cs.LG cs.RO
Published Date:

arXiv ID:

2601.16982v1

Quick Actions