Information Router for Mitigating Modality Dominance in Vision-Language Models

Generative AI & LLMs
Published: arXiv: 2604.16264v1
Authors

Seulgi Kim Mohit Prabhushankar Ghassan AlRegib

Abstract

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

Paper Summary

Problem
Vision-Language Models (VLMs) have achieved impressive performance in various benchmarks, but they often suffer from "modality dominance." This means that models rely too heavily on a single modality, such as text or images, and neglect the other modality. This can lead to performance failure, especially when one modality is corrupted or degraded.
Key Innovation
The researchers propose a new approach called MOIR (Multi-modal Information Router) to mitigate modality dominance. MOIR is an information-level fusion method that explicitly reduces information disparity between modalities. It identifies less informative tokens and selectively routes complementary information from a stronger modality to construct information-dense token representations.
Practical Impact
MOIR has the potential to improve the robustness and downstream performance of VLMs, particularly in real-world scenarios where input modalities may differ in information density and signal-to-noise ratios. By manipulating information, MOIR can enable more balanced modality contribution and reduce the reliance on a single modality. This can lead to more reliable and accurate multi-modal systems.
Analogy / Intuitive Explanation
Think of MOIR as a traffic controller that directs information flow between different modalities. Just as a traffic controller ensures that traffic flows smoothly and safely, MOIR ensures that information flows between modalities are balanced and effective. By doing so, MOIR helps to mitigate modality dominance and enables VLMs to make more informed decisions.
Paper Information
Categories:
cs.CV cs.LG
Published Date:

arXiv ID:

2604.16264v1

Quick Actions