Multi-agent LLM architectures for drug safety workflows

There is a pattern that keeps repeating in early LLM experiments for pharmacovigilance: someone builds a single prompt that tries to do everything at once. Extract the drug name. Identify the adverse event. Determine negation. Assign a MedDRA code. Assess seriousness. Summarize the narrative. All in one pass.

It sometimes works. On a clean, well-structured case narrative with a single drug and a clearly stated adverse event, a single-prompt approach can produce a reasonable output. But the moment the input gets messy — multi-drug regimens, ambiguous temporality, partially negated events, conflicting information across sections — the monolithic prompt starts to degrade in ways that are difficult to predict, diagnose, or fix.

That is not surprising. It is essentially what happens in any engineering system when you ask a single component to handle too many concerns at once. And it is why multi-agent architectures are starting to look like the more natural fit for drug safety workflows.

Multi-agent LLM pipeline for pharmacovigilance showing specialized agents for extraction, validation, MedDRA coding, and review
Figure 1. A multi-agent architecture for pharmacovigilance, where specialized LLM agents handle distinct sub-tasks with validation checkpoints between them.

What multi-agent means in practice

A multi-agent LLM system is not a committee of chatbots debating each other. It is a software architecture in which distinct LLM-powered components are orchestrated to handle specific sub-tasks, passing structured outputs between each other in a defined sequence. Each agent has a narrower scope, a more focused prompt, and, ideally, a more testable failure mode.1

In the context of pharmacovigilance case processing, that might look something like this. A first agent reads a source document and extracts candidate entities: drug names, adverse events, dates, dosages, outcomes. A second agent validates those extractions against the original text, checking for negation, speculation, and attribution errors. A third agent maps the validated terms to a controlled vocabulary such as MedDRA. A fourth agent assembles the structured output and flags uncertainties for human review.2

Each of these steps is a distinct NLP task with distinct failure modes. Extraction agents need to handle messy input. Validation agents need to catch negation and hedge language. Coding agents need to know the difference between a preferred term and a lowest-level term. Review agents need to know when confidence is too low to proceed without human input.

Trying to handle all of these in a single prompt is like trying to write one function that does parsing, validation, business logic, and error handling all in a single method. It can technically work. It does not scale, and it is nearly impossible to debug.

MALADE and what it showed

The most detailed published example of a multi-agent approach in pharmacovigilance is the MALADE system, which used multiple LLM agents orchestrated with retrieval-augmented generation to extract adverse drug event information from FDA drug labels.1

What made MALADE interesting was not just the headline performance — an AUC of 0.90 against the OMOP reference set — but the architecture itself. The system divided the task into stages: label retrieval, extraction, structured output generation, and evidence assembly. Each stage had its own agent, its own prompt, and its own set of expected outputs. The result was a system that could provide not only binary classifications but also probability scores, justifications, and evidence trails.1

That traceability is critical. In a single-prompt system, if the model produces a wrong answer, you are left staring at a black box wondering what went wrong. In a multi-agent system, you can inspect each stage independently. Did the extraction agent miss the term? Did the validation agent incorrectly mark it as negated? Did the coding agent map it to the wrong MedDRA term? Each failure mode becomes isolable, which means each one becomes fixable.

Why orchestration matters more than model choice

One of the underappreciated findings in recent pharmacovigilance LLM work is that system design often matters more than model selection. A smaller model with good orchestration can outperform a larger model operating as a monolith.3

This has practical implications. If the extraction agent needs to be fast and cheap because it runs on every incoming document, you can use a smaller, fine-tuned model for that step. If the validation agent needs to handle nuanced linguistic reasoning, you might allocate a more capable model there. If the coding agent needs to operate deterministically against a fixed vocabulary, you might not need a generative model at all — a retrieval-based lookup with fuzzy matching might suffice.

Multi-agent architectures let you match the right tool to each sub-task, rather than forcing every task through the same general-purpose model. That is not just more efficient. It is more auditable, because each component can be tested and validated independently.4

The coordination problem

Multi-agent systems are not free of their own difficulties. The most obvious one is coordination. When agents pass outputs to each other, errors can propagate. A wrong extraction in stage one becomes a wrong validation in stage two and a wrong code in stage three. If the system does not include checkpoints for catching upstream errors, the pipeline can produce confidently structured nonsense.

This is why the validation and review stages are not optional. They are load-bearing parts of the architecture. A multi-agent system without validation checkpoints is just a more complicated way of making the same mistakes a single-prompt system would make.

There is also the question of latency. Running multiple sequential LLM calls is slower than running one. For real-time applications, that matters. For batch processing of safety case intake, where throughput matters more than single-case latency, it is usually acceptable. But the trade-off should be explicit in any system design.2

Where this connects to regulatory expectations

Regulatory frameworks for pharmacovigilance do not yet prescribe specific AI architectures. But they do increasingly expect transparency, traceability, and explainability in automated safety processes. Multi-agent systems are naturally more compatible with these expectations than monolithic ones, because each agent’s output is an inspectable intermediate step.4

If a regulatory reviewer asks, “why did the system classify this event as serious?”, a multi-agent system can point to the extraction agent’s output, the validation agent’s confirmation, and the coding agent’s mapping. A monolithic system can only point to its final output and the original prompt.

That difference may seem abstract now. It will not seem abstract when these systems are handling thousands of cases per day and regulatory auditors want to understand how they work.

The honest assessment

Multi-agent architectures are not a silver bullet. They add complexity. They require careful orchestration. They demand good inter-agent interfaces and robust error handling. And they are still only as good as the prompts, the training data, and the retrieval infrastructure that support each individual agent.

But they represent a genuinely more mature way of thinking about LLMs in drug safety. Instead of asking one model to be good at everything, they ask each component to be good at one thing. That is not a new idea in software engineering. It is just a surprisingly new idea in how most teams think about deploying LLMs.

The best pharmacovigilance AI systems will probably not be the ones with the most impressive model. They will be the ones with the most thoughtful decomposition of the task.

  1. Choi J, Palumbo N, Chalasani P, et al. MALADE: Orchestration of LLM-powered agents with retrieval augmented generation for pharmacovigilance. arXiv. 2024. arXiv:2408.01869.  2 3

  2. Roemming H-J, Hauben M, Wannhoff W, et al. How LLMs can advance safety case intake — points to consider and insights from a proof of concept. Therapeutic Advances in Drug Safety. 2025;16. doi:10.1177/20420986251386222.  2

  3. De Vito G, Ferrucci F, Angelakis A. Enhancing medication safety with LLMs. Ital-IA 2025: 5th National Conference on Artificial Intelligence. 2025. 

  4. Hakim JB, Painter JL, Ramcharran D, et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Scientific Reports. 2025;15:27886. doi:10.1038/s41598-025-09138-0.  2

Back to blog