Scene Grounding In the Wild

Agentic AI
Published: arXiv: 2603.26584v1
Authors

Tamir Cohen Leo Segre Shay Shomer-Chai Shai Avidan Hadar Averbuch-Elor

Abstract

Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models. All code and data will be released.

Paper Summary

Problem
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision. This is especially true when the input views have little or no overlap, resulting in multiple disconnected partial reconstructions or erroneous geometry.
Key Innovation
This paper proposes a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. The key innovation is the use of pseudo-synthetic renderings, which provide full scene coverage but differ substantially in appearance from real-world photographs. The framework represents the reference model using 3D Gaussian Splatting and formulates alignment as an inverse feature-based optimization scheme.
Practical Impact
This research has significant practical implications for various applications, such as: * Improved 3D reconstruction of large-scale scenes from unstructured imagery * Enhanced global alignment and consistency in 3D models * Ability to merge partial, disjoint 3D reconstructions into a unified model * Potential applications in fields like architecture, urban planning, and geographic information systems (GIS)
Analogy / Intuitive Explanation
Imagine trying to assemble a large puzzle with many missing pieces. Each piece represents a partial 3D reconstruction, and the puzzle board represents the complete reference model. The framework proposed in this paper helps to align each piece with the puzzle board, ensuring that the entire puzzle is complete and accurate. This analogy illustrates the challenge of 3D reconstruction and the importance of global alignment.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2603.26584v1

Quick Actions