Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

AI in healthcare
Published: arXiv: 2603.26544v1
Authors

Maria Kefala Jeffery L. Painter Syed Tauhid Bukhari Maurizio Sessa

Abstract

Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.

Paper Summary

Problem
The main challenge addressed by this research paper is the lack of reliable reference datasets for evaluating the performance of signal detection methods in pharmacovigilance. Current datasets are limited in scope, size, or are outdated, making it difficult to develop more effective signal detection methods. This issue is particularly significant in the European Union (EU), where regulatory agencies face a time- and resource-demanding procedure to validate statistical alerts.
Key Innovation
The key innovation of this paper is the development of a time-indexed reference dataset for the EU, incorporating the timing of adverse event (AE) inclusion in product labels along with regulatory metadata. This dataset is designed to capture the timing of AE recognition by regulatory authorities, enabling the evaluation of signal detection methods' ability to detect new safety signals before regulatory confirmation.
Practical Impact
The practical impact of this research is significant, as it enables the development of more effective signal detection methods for pharmacovigilance. By providing a reliable reference dataset, regulatory agencies can improve the accuracy of signal detection and reduce the number of false-positive statistical alerts. This, in turn, can lead to better patient safety and more efficient use of resources.
Analogy / Intuitive Explanation
Imagine trying to find a needle in a haystack, but the haystack is constantly being rearranged. This is similar to the challenge of signal detection in pharmacovigilance, where the dataset is constantly changing and it's difficult to identify new safety signals. The time-indexed reference dataset developed in this paper is like a map that shows the location of the needle in the haystack, making it easier to find and identify new safety signals.
Paper Information
Categories:
cs.CL q-bio.QM
Published Date:

arXiv ID:

2603.26544v1

Quick Actions