360Anything: Geometry-Free Lifting of Images and Videos to 360°

AI in healthcare
Published: arXiv: 2601.16192v1
Authors

Ziyi Wu Daniel Watson Andrea Tagliasacchi David J. Fleet Marcus A. Brubaker Saurabh Saxena

Abstract

Lifting perspective images and videos to 360° panoramas enables immersive 3D world generation. Existing approaches often rely on explicit geometric alignment between the perspective and the equirectangular projection (ERP) space. Yet, this requires known camera metadata, obscuring the application to in-the-wild data where such calibration is typically absent or noisy. We propose 360Anything, a geometry-free framework built upon pre-trained diffusion transformers. By treating the perspective input and the panorama target simply as token sequences, 360Anything learns the perspective-to-equirectangular mapping in a purely data-driven way, eliminating the need for camera information. Our approach achieves state-of-the-art performance on both image and video perspective-to-360° generation, outperforming prior works that use ground-truth camera information. We also trace the root cause of the seam artifacts at ERP boundaries to zero-padding in the VAE encoder, and introduce Circular Latent Encoding to facilitate seamless generation. Finally, we show competitive results in zero-shot camera FoV and orientation estimation benchmarks, demonstrating 360Anything's deep geometric understanding and broader utility in computer vision tasks. Additional results are available at https://360anything.github.io/.

Paper Summary

Problem
Generating photorealistic 3D worlds is a challenging task in computer vision. Current approaches for lifting perspective images and videos to 360° panoramas often rely on explicit geometric alignment between the perspective and equirectangular projection (ERP) space, which requires known camera metadata. This limitation makes it difficult to apply these approaches to in-the-wild data where camera calibration is absent or noisy.
Key Innovation
The researchers propose a geometry-free framework called 360Anything, which uses pre-trained diffusion transformers to learn the perspective-to-equirectangular mapping in a purely data-driven way. This approach eliminates the need for camera information and achieves state-of-the-art performance on both image and video perspective-to-360° generation.
Practical Impact
The 360Anything framework has the potential to revolutionize the field of computer vision by enabling the creation of fully immersive 3D worlds without the need for explicit camera calibration. This could have significant applications in robotics, augmented reality, virtual reality, and gaming. The framework can also be used for tasks such as zero-shot camera field-of-view and orientation estimation, demonstrating its broader utility in computer vision tasks.
Analogy / Intuitive Explanation
Imagine trying to recreate a 3D painting from a single 2D photograph. The 360Anything framework is like a super-smart artist that can look at the 2D image and figure out how to fill in the missing pieces to create a complete 3D scene, without needing to know the exact camera angles or positions. This is achieved by using a combination of deep learning models and clever data processing techniques to learn the relationships between different parts of the image.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2601.16192v1

Quick Actions