Preventing Data Leakage in EEG-Based Survival Prediction: A Two-Stage Embedding and Transformer Framework

AI in healthcare
Published: arXiv: 2603.25923v1
Authors

Yixin Zhou Zhixiang Liu Vladimir I. Zadorozhny Jonathan Elmer

Abstract

Deep learning models have shown promise in EEG-based outcome prediction for comatose patients after cardiac arrest, but their reliability is often compromised by subtle forms of data leakage. In particular, when long EEG recordings are segmented into short windows and reused across multiple training stages, models may implicitly encode and propagate label information, leading to overly optimistic validation performance and poor generalization. In this study, we identify a previously overlooked form of data leakage in multi-stage EEG modeling pipelines. We demonstrate that violating strict patient-level separation can significantly inflate validation metrics while causing substantial degradation on independent test data. To address this issue, we propose a leakage-aware two-stage framework. In the first stage, short EEG segments are transformed into embedding representations using a convolutional neural network with an ArcFace objective. In the second stage, a Transformer-based model aggregates these embeddings to produce patient-level predictions, with strict isolation between training cohorts to eliminate leakage pathways. Experiments on a large-scale EEG dataset of post-cardiac-arrest patients show that the proposed framework achieves stable and generalizable performance under clinically relevant constraints, particularly in maintaining high sensitivity at stringent specificity thresholds. These results highlight the importance of rigorous data partitioning and provide a practical solution for reliable EEG-based outcome prediction.

Paper Summary

Problem
Cardiac arrest is a leading cause of death worldwide, and patients who survive initial resuscitation often face a difficult challenge in predicting neurological recovery. Clinicians need reliable decision-support systems to make outcome predictions under stringent safety constraints, particularly requiring near-zero false reassurance for patients who may still recover. However, current deep learning models for EEG-based outcome prediction are often compromised by subtle forms of data leakage, which can lead to overly optimistic validation performance and poor generalization.
Key Innovation
The researchers propose a leakage-aware two-stage framework to prevent data leakage in EEG-based survival prediction. In the first stage, short EEG segments are transformed into embedding representations using a convolutional neural network with an ArcFace objective. In the second stage, a Transformer-based model aggregates these embeddings to produce patient-level predictions, with strict isolation between training cohorts to eliminate leakage pathways.
Practical Impact
This research has significant practical implications for the prediction of neurological recovery in post-cardiac-arrest patients. The proposed framework achieves stable and generalizable performance under clinically relevant constraints, particularly in maintaining high sensitivity at stringent specificity thresholds. This means that clinicians can rely on the model's predictions to make informed decisions about patient care, reducing the risk of premature withdrawal of life-sustaining therapy or prolonged treatment of non-recoverable patients.
Analogy / Intuitive Explanation
Imagine trying to predict a person's personality based on a series of short video clips. If you reuse the same clips multiple times, you might get a misleading impression of their personality. Similarly, in EEG-based outcome prediction, reusing EEG segments multiple times can lead to data leakage, which can distort the model's predictions. The proposed framework is like a "clip editor" that ensures each video clip is used only once, providing a more accurate and reliable prediction of the person's personality (or in this case, the patient's neurological recovery).
Paper Information
Categories:
cs.LG
Published Date:

arXiv ID:

2603.25923v1

Quick Actions