Why retrieval-augmented generation matters more than model size in pharmacovigilance
There is a recurring assumption in conversations about AI and drug safety that goes something like this: if we just use a bigger model, the outputs will be more reliable. More parameters, better performance, safer answers.
That assumption deserves pushback, especially in pharmacovigilance, where the consequences of a fluent but wrong answer can ripple into signal detection, case assessment, and regulatory decisions.
The more interesting question is not how large the model is. It is whether the model’s output can be traced back to a specific, authoritative source. And that is where retrieval-augmented generation (RAG) starts to matter more than raw model scale.
The hallucination problem is not abstract in drug safety
Large language models generate text by predicting likely continuations of input. That mechanism is powerful but also inherently risky: a model can produce a sentence that sounds completely convincing and is entirely fabricated. In low-stakes domains, hallucination is an annoyance. In pharmacovigilance, it is a safety issue.1
Consider a scenario where an LLM is used to help extract adverse events from a drug label or a case narrative. If the model hallucinates an adverse event that is not actually in the source document, that error could enter a safety database, distort a disproportionality analysis, or mislead a reviewer. If it confidently omits a negated term or invents a causal connection, the downstream consequences are real.12
This is not a theoretical concern. Published work has shown that LLM hallucination is especially problematic in low-probability, high-stakes scenarios, which describes much of what pharmacovigilance deals with on a daily basis.3
What RAG actually does
RAG addresses this by changing the information architecture around the model. Instead of asking the LLM to answer from its internal parameters alone, the system first retrieves relevant documents from an external knowledge base and then asks the model to generate a response grounded in that retrieved context.1
In pharmacovigilance terms, that means the model is not guessing what a label says. It is reading the label, or the case narrative, or the regulatory guidance, and then constructing its output from that source material. The result is not a replacement for human review, but it is a much more defensible starting point than a model improvising from training data.
This is also why retrieval-grounded approaches align more naturally with the expectations of regulated environments. If a model points to the exact section of a label from which it derived its output, that traceability is something a reviewer can check. If the model simply produces an answer from its parametric memory, there is no source to verify against.4
Why model size alone does not solve the problem
Larger models tend to be more capable in terms of fluency, reasoning, and general knowledge. But capability and reliability are not the same thing. A more capable model can produce a more convincing hallucination. It can sound more authoritative while being just as wrong.
There is also a practical dimension. Smaller fine-tuned models have been shown to outperform larger general-purpose models on specific drug safety tasks. One recent study found that a fine-tuned Phi-3.5 model achieved higher sensitivity and accuracy for drug-drug interaction prediction than larger proprietary alternatives across multiple validation datasets.5 That result is consistent with a broader pattern: in domain-specific applications, task adaptation often matters more than parameter count.
For pharmacovigilance, this has real implications. Smaller models are cheaper to deploy, easier to audit, more compatible with privacy-sensitive environments, and often better suited to the specific linguistic patterns of safety data. A well-adapted small model with good retrieval infrastructure may outperform a massive general-purpose model operating without grounding.
RAG fits the way pharmacovigilance should work
One of the most important features of RAG in this context is that it supports the human-in-the-loop model that drug safety demands. The best use case is not a model that gives a final answer. It is a model that retrieves relevant evidence, organizes it, and presents it for expert review.4
The MALADE system, for example, used multi-agent LLM orchestration with RAG to extract adverse drug event information from FDA drug labels. The system did not just produce binary labels. It generated structured outputs with justifications, probability scores, and evidence trails, achieving an AUC of 0.90 against the OMOP reference set.6
That kind of architecture is interesting not because of any single performance number, but because it shows what becomes possible when retrieval is treated as a first-class design choice rather than an afterthought.
The real frontier is grounding, not scaling
RAG is not a magic solution. Retrieval quality depends on how well documents are indexed, chunked, and matched to queries. If the retrieval step returns irrelevant or incomplete context, the generation step will still produce poor outputs. And even with good retrieval, the model can still misinterpret or selectively attend to the retrieved text.
But those are engineering problems with known directions. The harder challenge, and the one that model scaling alone does not address, is building systems whose outputs can be traced, verified, and defended in a regulatory context.
The guardrails paper makes this point clearly: RAG and guardrail frameworks should be used together, because they address complementary failure modes. RAG reduces hallucination by grounding outputs in real sources. Guardrails catch errors that slip through, such as incorrect drug names, anomalous documents, or uncertain outputs that need to be flagged rather than presented as confident answers.1
What this means in practice
For teams building LLM-based tools for drug safety, the practical takeaway is straightforward. Investing in retrieval infrastructure, source document management, and output traceability is likely to yield more reliable systems than simply upgrading to the next larger model.
That means investing in things that do not always look exciting: clean document pipelines, well-structured knowledge bases, citation mechanisms, confidence indicators, and reviewer interfaces that make it easy to verify what the model actually read before it generated its answer.
Those are infrastructure problems, not model architecture problems. And they are the problems most likely to determine whether LLMs become genuinely useful in pharmacovigilance or remain impressive but untrustworthy demos.
-
Hakim JB, Painter JL, Ramcharran D, et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Scientific Reports. 2025;15:27886. doi:10.1038/s41598-025-09138-0 ↩ ↩2 ↩3 ↩4
-
Gisladottir U, Zietz M, Kivelson S, et al. Leveraging large language models in extracting drug safety information from prescription drug labels. Drug Safety. 2025. doi:10.1007/s40264-025-01594-x ↩
-
Kovalerchuk S, Sordo M, Ostojic D, et al. A medically grounded LLM agent–based tool to detect patient safety events in medical records. medRxiv. 2025. doi:10.64898/2025.12.16.25342438 ↩
-
Wu L, Qu Y, Xu J, et al. A framework enabling LLMs into regulatory environment for transparency and trustworthiness and its application to drug labeling document. Regulatory Toxicology and Pharmacology. 2024;148:105597. doi:10.1016/j.yrtph.2024.105597 ↩ ↩2
-
De Vito G, Ferrucci F, Angelakis A. Enhancing medication safety with LLMs. Ital-IA 2025: 5th National Conference on Artificial Intelligence. 2025. ↩
-
Choi J, Palumbo N, Chalasani P, et al. MALADE: Orchestration of LLM-powered agents with retrieval augmented generation for pharmacovigilance. arXiv. 2024. arXiv:2408.01869. ↩