Multi-task Cross-modal Learning for Chest X-ray Image Retrieval

AI in healthcare
Published: arXiv: 2601.05399v1
Authors

Zhaohui Liang Sivaramakrishnan Rajaraman Niccolo Marini Zhiyun Xue Sameer Antani

Abstract

CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model's enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.

Paper Summary

Problem
The main problem this paper addresses is the limitation of general-purpose vision-language models in accurately retrieving clinically relevant radiology reports using chest X-ray (CXR) images. These models are not optimized for fine-grained medical retrieval tasks, leading to poor performance in distinguishing normal from abnormal CXR studies.
Key Innovation
The paper proposes a multi-task learning framework to fine-tune the BiomedCLIP model, which is a vision-language foundation model, to improve CXR image-text retrieval. This framework incorporates a lightweight MLP projector head trained with a multi-task composite loss function that includes binary cross-entropy loss, supervised contrastive loss, and CLIP loss. This approach allows the model to learn subtle distinctions in visually similar images and capture the specificity and complexity of medical data.
Practical Impact
This research has significant practical implications for the field of medical imaging. The fine-tuned model can be applied to improve the accuracy and efficiency of radiology report generation, clinical decision support, and automated diagnosis. The model's enhanced diagnostic sensitivity can also lead to better patient outcomes and reduced healthcare costs. Furthermore, the proposed framework can be adapted to other medical imaging modalities and applications, such as computed tomography (CT) images and magnetic resonance imaging (MRI) scans.
Analogy / Intuitive Explanation
Imagine a medical librarian who needs to retrieve a specific book (radiology report) from a vast library (medical database) based on a description (CXR image). The general-purpose vision-language models are like a librarian who doesn't speak the language of the library (medical terminology) and can't navigate the complex cataloging system (medical imaging modalities). The proposed multi-task learning framework is like a librarian who has been trained in the language of the library and has learned to navigate the cataloging system, allowing them to retrieve the correct book (radiology report) with high accuracy.
Paper Information
Categories:
cs.CV cs.AI cs.IR
Published Date:

arXiv ID:

2601.05399v1

Quick Actions