GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss

Generative AI & LLMs
Published: arXiv: 2601.16885v1
Authors

Yangfan Xu Lilian Zhang Xiaofeng He Pengdong Wu Wenqi Wu Jun Mao

Abstract

Transformer-based general visual geometry frameworks have shown promising performance in camera pose estimation and 3D scene understanding. Recent advancements in Visual Geometry Grounded Transformer (VGGT) models have shown great promise in camera pose estimation and 3D reconstruction. However, these models typically rely on ground truth labels for training, posing challenges when adapting to unlabeled and unseen scenes. In this paper, we propose a self-supervised framework to train VGGT with unlabeled data, thereby enhancing its localization capability in large-scale environments. To achieve this, we extend conventional pair-wise relations to sequence-wise geometric constraints for self-supervised learning. Specifically, in each sequence, we sample multiple source frames and geometrically project them onto different target frames, which improves temporal feature consistency. We formulate physical photometric consistency and geometric constraints as a joint optimization loss to circumvent the requirement for hard labels. By training the model with this proposed method, not only the local and global cross-view attention layers but also the camera and depth heads can effectively capture the underlying multi-view geometry. Experiments demonstrate that the model converges within hundreds of iterations and achieves significant improvements in large-scale localization. Our code will be released at https://github.com/X-yangfan/GPA-VGGT.

Paper Summary

Problem
Estimating camera pose and scene geometry is a crucial problem in computer vision, with applications in visual localization, autonomous driving, and large-scale 3D scene understanding. However, existing methods struggle to maintain physically consistent geometry when scaled to long trajectories and complex environments.
Key Innovation
This paper proposes a novel self-supervised framework to train the Visual Geometry Grounded Transformer (VGGT) for large-scale localization using unlabeled data. The framework extends conventional pair-wise relations to sequence-wise geometric constraints, improving temporal feature consistency and enabling the model to capture underlying multi-view geometry.
Practical Impact
The proposed framework has significant practical implications for large-scale camera pose and scene geometry estimation. By training the model with unlabeled data, it can be deployed in unseen, wild environments without ground-truth labels. The framework's ability to learn stable and scalable geometric representations leads to improved performance and generalization in large-scale environments.
Analogy / Intuitive Explanation
Imagine trying to build a 3D puzzle with a large number of pieces. Existing methods might focus on pairing individual pieces, but this approach has limitations when dealing with complex scenes and long trajectories. The proposed framework takes a more comprehensive approach, considering the relationships between multiple pieces across the entire puzzle, allowing for more accurate and robust results.
Paper Information
Categories:
cs.CV cs.RO
Published Date:

arXiv ID:

2601.16885v1

Quick Actions