April 14, 2026

Benchmarking signal detection algorithms: the reference set problem

If you want to know whether a signal detection algorithm works, you need something to test it against. That sounds obvious, but in pharmacovigilance it turns out to be one of the hardest methodological problems in the field.

The core challenge is this: to evaluate whether a disproportionality method correctly identifies safety signals, you need a set of drug–event pairs where the truth is already known. Some pairs should be positive controls — drugs that genuinely cause a particular adverse event. Others should be negative controls — drugs that do not cause the event. You then run your algorithm, compare its output to the reference set, and measure performance.¹

The problem is that in drug safety, “truth” is rarely as clean as that framing implies.

What a reference set is and why it matters

A pharmacovigilance reference set is a curated list of drug–adverse event pairs classified as either positive or negative associations. The most widely used example is the reference set developed by Ryan and colleagues as part of the Observational Medical Outcomes Partnership (OMOP). It contains 165 positive controls and 234 negative controls across four serious outcomes: acute liver injury, acute kidney injury, acute myocardial infarction, and upper gastrointestinal bleeding. The drugs span several therapeutic classes including NSAIDs, antibiotics, antidepressants, antihypertensives, antiepileptics, and glucose-lowering agents.¹

A second widely used set is the EU-ADR reference standard, which covers 10 adverse events with 44 positive and 50 negative control pairs, and was developed as part of the European project Exploring and Understanding Adverse Drug Reactions.²

These reference sets serve as the shared benchmarks against which different signal detection methods are compared. Without them, performance claims are largely unanchored. With them, researchers can compute metrics like AUC, sensitivity, specificity, and positive predictive value, and can compare methods head to head.

The positive control problem

Defining positive controls requires deciding what counts as sufficient evidence that a drug causes an adverse event. In the OMOP reference set, positive controls were identified through systematic literature review and natural language processing of structured product labeling. Some were supported by randomised clinical trial evidence. Others were based only on published case reports or case series cited in standard drug-injury compendia.¹

That is a reasonable approach, but it introduces asymmetry. A drug–event pair supported by randomised trial evidence is a much stronger positive control than one supported by a handful of case reports. Lumping them together means the reference set contains positive controls of varying strength, and the algorithm’s ability to detect well-established associations may be quite different from its ability to detect weaker or more contested ones.

There is also a temporal circularity problem. Many positive controls are derived from drug labels, which were themselves informed by spontaneous reporting data. If a signal detection algorithm is then evaluated on a spontaneous reporting database using those same label-derived controls, the evaluation is partially circular: the algorithm is being tested on associations that the database helped establish in the first place.³

The negative control problem is harder

Positive controls are difficult to define. Negative controls are even harder.

A negative control is a drug–event pair where the drug is known not to cause the event. But in pharmacovigilance, absence of evidence is not the same as evidence of absence. Just because a particular adverse reaction has not been reported or studied in connection with a particular drug does not mean the association does not exist. It may mean the association is rare, has not yet been investigated, or occurs only in populations that have not been studied.

This is not a hypothetical concern. Hauben and colleagues examined the OMOP negative controls and found evidence of misclassification in a meaningful proportion of them. By searching the medical literature for associations between drugs and events classified as negative controls by OMOP, they found that approximately 17% of negative controls had published evidence suggesting a possible association.⁴

That level of misclassification has direct consequences. If negative controls are actually weak positives, then an algorithm that correctly flags them will appear to have poor specificity when measured against the reference set. Conversely, an algorithm that misses them will appear to perform better than it actually does.

Design choices shape performance estimates

Beyond the classification of individual pairs, the overall design of the reference set — its size, composition, and inclusion criteria — can substantially affect measured performance.

A study by Candore and colleagues investigated this systematically for drug–drug interaction signal detection. They generated reference sets of varying sizes and compositions, applying different design criteria such as event background prevalence, theoretical evidence strength, and restriction to designated medical events. They found that some criteria had a large impact on measured performance, with different signal detection algorithms being affected to different degrees by different criteria.⁵

The practical implication is that two studies comparing the same signal detection methods but using different reference sets may reach different conclusions about which method is better. That is not necessarily because one study is wrong. It is because the benchmark itself is a variable.

This is a well-known problem in machine learning, where benchmark design can inflate or deflate apparent performance in ways that do not reflect real-world utility. In pharmacovigilance, the stakes are higher, because performance estimates for signal detection methods inform decisions about which tools to deploy in actual regulatory and industry safety workflows.

Variability across methods is larger than expected

A study using the OMOP reference set to evaluate multiple models of two widely used disproportionality approaches — the reporting odds ratio and the Bayesian confidence propagation neural network — found considerable variability in results. Both positive and negative signals could be generated for 60% of all drug–event pairs depending on which model specification was used.⁶

That finding is worth pausing on. It means that for the majority of pairs in the reference set, whether a signal was detected or not depended on analytic choices rather than on the underlying data alone. The authors argued that this variability should be leveraged rather than hidden, and that presenting a range of sensitivity analyses is more informative than reporting a single result.

For benchmarking, this has a sobering implication. A single performance number for a signal detection algorithm, derived from a single reference set using a single model specification, is a thin summary of a much messier reality.

What would a better benchmarking practice look like?

Several improvements have been proposed, though none fully resolves the problem.

First, reference sets should be larger and more diverse. Covering only four outcomes, as in the OMOP set, limits generalisability. Efforts to build broader reference sets, incorporating more outcomes, more drug classes, and more evidence sources, would provide a more representative benchmark.⁷

Second, the strength of evidence for each positive and negative control should be explicitly graded. Not all positive controls are equally well established, and not all negative controls are equally confident. Treating them as binary categories loses important information.

Third, reference sets should be population-specific. The OMOP and EU-ADR sets were designed for general adult populations. Paediatric pharmacovigilance requires its own reference sets with drugs and events relevant to children, as the work by de Groot and colleagues has argued.⁸

Fourth, benchmarking studies should report performance across multiple model specifications and sensitivity analyses, not just a single point estimate. If the result changes meaningfully with different analytic choices, that instability is itself informative and should be reported.

And fifth, the pharmacovigilance community should treat reference sets as living resources that need regular updating. Drug knowledge evolves. Associations that were negative controls a decade ago may now be recognised as positive. Reference sets that are not maintained become progressively less reliable.

The honest conclusion

Benchmarking signal detection algorithms is essential. Without it, we have no way to compare methods, calibrate expectations, or identify failure modes. But the reference sets we use are imperfect in ways that directly affect the conclusions we draw.

That does not mean benchmarking is useless. It means that performance numbers should always be interpreted in the context of the reference set that produced them. What outcomes does it cover? How were controls classified? What evidence thresholds were applied? How old is it?

Those are not footnote questions. They are central to knowing what a performance number actually means.

Ryan P, Schuemie MJ, Welebob E, Duke J, Valentine S, Hartzema AG. Defining a reference set to support methodological research in drug safety. Drug Safety. 2013;36(Suppl 1):S33–S47. ↩ ↩² ↩³
Coloma PM, Avillach P, Salber R, et al. A reference standard for evaluation of methods for drug safety signal detection using electronic healthcare record databases. Drug Safety. 2013;36(1):13–23. ↩
Harpaz R, DuMouchel W, Shah NH, Madigan D, Ryan P, Friedman C. Novel data-mining methodologies for adverse drug event discovery and analysis. Clinical Pharmacology & Therapeutics. 2012;91(6):1010–1021. ↩
Hauben M, Hung E, Engberg S. Evidence of misclassification of drug–event associations classified as gold standard ‘negative controls’ by the Observational Medical Outcomes Partnership (OMOP). Drug Safety. 2016;39(5):421–432. ↩
Candore G, Juhlin K, Manlik K, et al. Exploring the impact of design criteria for reference sets on performance evaluation of signal detection algorithms: the case of drug–drug interactions. Pharmacoepidemiology and Drug Safety. 2024;33(3):e5758. ↩
Kreimeyer K, Maro JC, Engberg S, et al. Leveraging the variability of pharmacovigilance disproportionality analyses to improve signal detection performances. Frontiers in Pharmacology. 2021;12:668765. ↩
Seo H, Kim E, Lee S, et al. A data-driven reference standard for adverse drug reaction (RS-ADR) signal assessment: development and validation. JMIR Medical Informatics. 2022;10(10):e40164. ↩
de Groot MCH, van Puijenbroek EP, van Eijk ME, et al. Pediatric drug safety signal detection: a new drug–event reference set for performance testing of data-mining methods and systems. Drug Safety. 2015;38(2):207–217. ↩

Back to blog