Data distribution impacts the performance and generalisability of contrastive learning-based foundation models of electrocardiograms

AI in healthcare
Published: arXiv: 2509.10369v1
Authors

Gul Rukh Khattak Konstantinos Patlatzoglou Joseph Barker Libor Pastika Boroumand Zeidaabadi Ahmed El-Medany Hesham Aggour Yixiu Liang Antonio H. Ribeiro Jeffrey Annis Antonio Luiz Pinho Ribeiro Junbo Ge Daniel B. Kramer Jonathan W. Waks Evan Brittain Nicholas Peters Fu Siong Ng Arunashis Sau

Abstract

Contrastive learning is a widely adopted self-supervised pretraining strategy, yet its dependence on cohort composition remains underexplored. We present Contrasting by Patient Augmented Electrocardiograms (CAPE) foundation model and pretrain on four cohorts (n = 5,203,352), from diverse populations across three continents (North America, South America, Asia). We systematically assess how cohort demographics, health status, and population diversity influence the downstream performance for prediction tasks also including two additional cohorts from another continent (Europe). We find that downstream performance depends on the distributional properties of the pretraining cohort, including demographics and health status. Moreover, while pretraining with a multi-centre, demographically diverse cohort improves in-distribution accuracy, it reduces out-of-distribution (OOD) generalisation of our contrastive approach by encoding cohort-specific artifacts. To address this, we propose the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances OOD robustness. This work provides important insights for developing clinically fair and generalisable foundation models.

Paper Summary

Problem
The main problem this paper addresses is the lack of understanding about how data distribution impacts the performance and generalizability of contrastive learning-based foundation models in electrocardiogram (ECG) analysis. Specifically, it explores how the composition of the pretraining data affects the learned representations and downstream performance of these models.
Key Innovation
The paper proposes a novel approach called the In-Distribution Batch (IDB) strategy, which preserves intra-cohort consistency during pretraining and enhances out-of-distribution (OOD) robustness. This approach rejects learning spurious technical cohort-specific features and instead learns more robust features that retain performance when tested on external cohorts.
Practical Impact
This research has significant implications for the development of clinically fair and generalizable foundation models in healthcare. By understanding how data distribution affects model performance, researchers and clinicians can design more effective pretraining protocols that improve model generalizability and reduce the risk of biased or inaccurate results. This can ultimately lead to better patient outcomes and more equitable healthcare systems.
Analogy / Intuitive Explanation
Imagine you're trying to learn to recognize different types of cars. If you only see pictures of sports cars in one color, you might become really good at recognizing sports cars in that color, but not very good at recognizing other types of cars or sports cars in different colors. Similarly, if a machine learning model is trained on a diverse dataset of ECGs from different populations, it might become good at recognizing patterns in those specific populations, but not very good at recognizing patterns in other populations. The IDB strategy is like trying to train the model to recognize the underlying patterns in the ECGs themselves, rather than the specific characteristics of the population it was trained on.
Paper Information
Categories:
cs.LG cs.AI eess.SP q-bio.TO
Published Date:

arXiv ID:

2509.10369v1

Quick Actions