DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding

Agentic AI

Published: arXiv: 2511.11552v1

Authors

Dawei Zhu Rui Meng Jiefeng Chen Sujian Li Tomas Pfister Jinsung Yoon

Abstract

Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in'' on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.

Paper Summary

Problem

Long visual documents like financial reports, academic papers, and technical manuals are challenging to understand due to the vast amount of information synthesized from various textual and visual elements. Even advanced Vision-Language Models (VLMs) struggle to decipher these documents, primarily due to the difficulty of localizing relevant evidence.

Key Innovation

DocLens is a tool-augmented multi-agent framework that effectively "zooms in" on evidence like a lens. It consists of two primary components: the Lens Module and the Reasoning Module. The Lens Module identifies relevant pages and key elements within them, while the Reasoning Module conducts in-depth analysis of this evidence to generate a precise answer.

Practical Impact

DocLens has the potential to revolutionize the way we interact with long visual documents. By effectively localizing relevant evidence, DocLens can help humans quickly and accurately understand complex documents, making it an invaluable tool for professionals, students, and researchers. The framework's superiority is particularly evident on vision-centric and unanswerable queries, demonstrating its power in enhanced localization capabilities.

Analogy / Intuitive Explanation

Imagine trying to find a specific sentence in a 100-page book. A traditional search would involve scanning the entire book, which is time-consuming and labor-intensive. DocLens is like a high-powered microscope that zooms in on the relevant pages and identifies the key sentences, making it much faster and more efficient to find the information you need.

Paper Information

Categories:

cs.CV cs.CL

Published Date:

arXiv ID:

2511.11552v1

Quick Actions

Back to Home