Fus3D: Decoding Consolidated 3D Geometry from Feed-forward Geometry Transformer Latents

AI in healthcare
Published: arXiv: 2603.25827v1
Authors

Laura Fink Linus Franke George Kopanas Marc Stamminger Peter Hedman

Abstract

We propose a feed-forward method for dense Signed Distance Field (SDF) regression from unstructured image collections in less than three seconds, without camera calibration or post-hoc fusion. Our key insight is that the intermediate feature space of pretrained multi-view feed-forward geometry transformers already encodes a powerful joint world representation; yet, existing pipelines discard it, routing features through per-view prediction heads before assembling 3D geometry post-hoc, which discards valuable completeness information and accumulates inaccuracies. We instead perform 3D extraction directly from geometry transformer features via learned volumetric extraction: voxelized canonical embeddings that progressively absorb multi-view geometry information through interleaved cross- and self-attention into a structured volumetric latent grid. A simple convolutional decoder then maps this grid to a dense SDF. We additionally propose a scalable, validity-aware supervision scheme directly using SDFs derived from depth maps or 3D assets, tackling practical issues like non-watertight meshes. Our approach yields complete and well-defined distance values across sparse- and dense-view settings and demonstrates geometrically plausible completions. Code and further material can be found at https://lorafib.github.io/fus3d.

Paper Summary

Problem
The main problem the paper addresses is how to reconstruct 3D scenes from unstructured image collections, particularly in sparse-view settings where input views are limited or poorly conditioned. This is a fundamental challenge in computer vision with broad implications for tasks like semantic understanding, robotics, and scene interaction.
Key Innovation
The key innovation of this work is Fus3D, a feed-forward pipeline for dense surface reconstruction that directly regresses Signed Distance Functions (SDFs) from the intermediate feature space of a pretrained multi-view geometry transformer. This approach bypasses the traditional predict-then-fuse pipelines, which discard valuable completeness information and accumulate inaccuracies under many views.
Practical Impact
Fus3D has the potential to revolutionize 3D reconstruction in various fields, including: * **Robotics**: Accurate 3D reconstruction enables robots to navigate and interact with their environment more effectively. * **Computer-Aided Design (CAD)**: Fus3D can help create detailed 3D models of objects and scenes, facilitating design and simulation. * **Virtual Reality (VR) and Augmented Reality (AR)**: High-quality 3D reconstructions enable immersive experiences and applications.
Analogy / Intuitive Explanation
Imagine trying to assemble a 3D puzzle from a set of 2D images. Traditional pipelines would attempt to solve each image individually, then try to merge the results, which can lead to gaps and inaccuracies. Fus3D, on the other hand, takes the intermediate features from the 2D images and directly assembles them into a complete 3D model, like a 3D puzzle that fits together seamlessly. This approach preserves the joint multi-view prior, resulting in more accurate and complete reconstructions.
Paper Information
Categories:
cs.CV
Published Date:

arXiv ID:

2603.25827v1

Quick Actions