GLAM: Geometry-Guided Local Alignment for Multi-View VLP in Mammography

Computer Vision & MultiModal AI
Published: arXiv: 2509.10344v1
Authors

Yuexi Du Lihui Chen Nicha C. Dvornek

Abstract

Mammography screening is an essential tool for early detection of breast cancer. The speed and accuracy of mammography interpretation have the potential to be improved with deep learning methods. However, the development of a foundation visual language model (VLM) is hindered by limited data and domain differences between natural and medical images. Existing mammography VLMs, adapted from natural images, often ignore domain-specific characteristics, such as multi-view relationships in mammography. Unlike radiologists who analyze both views together to process ipsilateral correspondence, current methods treat them as independent images or do not properly model the multi-view correspondence learning, losing critical geometric context and resulting in suboptimal prediction. We propose GLAM: Global and Local Alignment for Multi-view mammography for VLM pretraining using geometry guidance. By leveraging the prior knowledge about the multi-view imaging process of mammograms, our model learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. Pretrained on EMBED [14], one of the largest open mammography datasets, our model outperforms baselines across multiple datasets under different settings.

Paper Summary

Problem
Mammography screening is a crucial tool for early breast cancer detection. However, the accuracy and speed of mammography interpretation can be improved. Deep learning methods have the potential to enhance mammography analysis, but existing models often ignore the unique characteristics of mammography, such as multi-view relationships.
Key Innovation
The researchers propose a new model called GLAM (Geometry-Guided Local Alignment for Multi-View), which leverages prior knowledge about the multi-view imaging process of mammograms. GLAM learns local cross-view alignments and fine-grained local features through joint global and local, visual-visual, and visual-language contrastive learning. This approach allows the model to better understand the relationships between the two views of the breast.
Practical Impact
The GLAM model has the potential to improve the accuracy and speed of mammography interpretation. By learning the relationships between the two views of the breast, the model can better detect tumors and other abnormalities. This can lead to earlier detection and treatment of breast cancer, which can improve patient outcomes.
Analogy / Intuitive Explanation
Imagine trying to identify a specific object in a room by looking at it from different angles. If you only look at it from one angle, you might miss important features or details. But if you look at it from multiple angles, you can get a more complete understanding of the object. Similarly, in mammography, the two views of the breast provide different information, and the GLAM model is able to combine this information to get a better understanding of the breast tissue. This allows the model to make more accurate diagnoses and improve patient outcomes.
Paper Information
Categories:
cs.CV cs.AI cs.LG
Published Date:

arXiv ID:

2509.10344v1

Quick Actions