Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation

AI in healthcare
Published: arXiv: 2508.13068v1
Authors

Tanjim Islam Riju Shuchismita Anwar Saman Sarker Joy Farig Sadeque Swakkhar Shatabda

Abstract

We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.

Paper Summary

Problem
Radiology reports are crucial for clinical decision making, but current report-generation systems struggle to produce accurate and interpretable reports from chest X-rays. The main challenge is integrating different modalities of data, including visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals.
Key Innovation
The proposed framework addresses this challenge by introducing a two-stage multimodal approach that leverages the MIMIC-Eye dataset. The first stage uses gaze-guided contrastive learning to improve disease classification, while the second stage generates region-aware radiology reports. The key innovation is the use of a novel multi-term gaze-attention loss that unifies different facets of fixation data, such as pixel-wise fidelity and pattern-aware similarity.
Practical Impact
This research has several practical implications. First, it demonstrates the effectiveness of incorporating eye-tracking data into multimodal learning for disease classification and report generation. Second, it shows how to generate region-aligned sentences via structured prompts, improving report quality. Third, it highlights the potential benefits of using gaze-informed attention supervision in radiology applications.
Analogy / Intuitive Explanation
Imagine trying to understand a medical image without knowing what the radiologist is looking at. By incorporating eye-tracking data, this framework helps machines "look" at the same things as humans, improving their ability to classify diseases and generate accurate reports. This is like having a virtual guide that shows you where to focus your attention, making it easier to understand complex medical images. Overall, this research has significant implications for the development of AI-powered radiology report generation systems that can provide accurate and interpretable results.
Paper Information
Categories:
cs.CV cs.LG
Published Date:

arXiv ID:

2508.13068v1

Quick Actions