Fine-tuning small models for drug safety: when, why, and how much data you actually need

One of the most counterintuitive findings in recent pharmacovigilance AI research is that bigger is not always better. A fine-tuned Phi-3.5 model — small enough to run on a single GPU — achieved higher sensitivity and accuracy for drug-drug interaction prediction than larger proprietary models across 13 validation datasets.1 That result should reshape how pharmacovigilance teams think about model selection.

The instinct to reach for the largest available model is understandable. Larger models generally perform better on general benchmarks, handle more varied inputs, and require less task-specific adaptation. But in drug safety, the task is often narrow, the data is domain-specific, and the deployment constraints — privacy, cost, auditability, reproducibility — favour smaller, locally controlled models.

Fine-tuning is the bridge. It takes a pre-trained model that knows language and adapts it to know your specific task. Done well, it produces a system that is cheaper to run, easier to audit, and often more accurate on the exact problem you care about than a model ten times its size.

Decision flowchart for when to fine-tune a small model versus using a large general-purpose model for pharmacovigilance tasks
Figure 1. A practical decision framework for choosing between fine-tuning a small model and using a general-purpose large model for drug safety tasks. The decision depends on task specificity, data availability, deployment constraints, and auditability requirements.

When fine-tuning makes sense

Fine-tuning is not always the right choice. It makes the most sense when three conditions are met.

First, the task is well-defined and narrow. Extracting adverse events from case narratives is a well-defined task. "Understanding drug safety" is not. Fine-tuning works best when you can specify exactly what the model should produce given a particular input. If the task is open-ended or highly variable, a larger general-purpose model with careful prompting may be more flexible.

Second, domain-specific data is available. Fine-tuning adapts a model using examples of the task you want it to perform. If you have annotated case narratives, coded adverse event reports, or labelled extraction datasets, those examples are the fuel for fine-tuning. Without them, you are relying on the base model's general knowledge, and fine-tuning has nothing to adapt from.

Third, deployment constraints matter. If you need to process data on-premises because of patient privacy requirements, or if you need reproducible outputs for regulatory submissions, or if you need to run inference at scale without per-query API costs, a fine-tuned small model is dramatically more practical than a large model accessed through an external API.1

What fine-tuning actually does

Pre-trained language models learn general language patterns from massive text corpora. Fine-tuning adjusts the model's parameters on a smaller, task-specific dataset so that its outputs become aligned with the target task. The model retains its general language understanding but becomes specialised in the patterns it sees during fine-tuning.

Full fine-tuning updates all model parameters. This produces the most complete adaptation but requires significant compute and memory, even for relatively small models. For a 3-billion parameter model, full fine-tuning requires holding the model, its gradients, and the optimiser states in memory simultaneously.

Parameter-efficient fine-tuning methods such as LoRA (Low-Rank Adaptation) and QLoRA offer a more practical alternative. LoRA freezes the original model parameters and injects small, trainable rank-decomposition matrices into each layer. This reduces the number of trainable parameters by orders of magnitude — typically to less than 1% of the original model — while retaining most of the performance of full fine-tuning.2

QLoRA goes further by combining LoRA with 4-bit quantisation of the base model, allowing fine-tuning of models that would otherwise require multi-GPU setups on a single consumer GPU. A 7-billion parameter model that normally requires 28 GB of GPU memory can be fine-tuned in under 10 GB using QLoRA.3

These techniques have made fine-tuning accessible to teams that do not have large-scale compute infrastructure. For pharmacovigilance teams in pharmaceutical companies, academic medical centres, or regulatory agencies, that accessibility is a practical game-changer.

How much data do you actually need?

This is the question every team asks first, and the honest answer is: it depends, but probably less than you think.

For classification tasks — such as determining whether a narrative describes a serious adverse event — published work suggests that fine-tuned models can achieve strong performance with as few as 500 to 2,000 labelled examples, depending on the complexity of the classification and the quality of the labels.4

For extraction tasks — such as identifying drug names and adverse events in free text — the data requirements are somewhat higher because the model needs to learn to recognise entity boundaries and types. Published work on biomedical named entity recognition suggests that 1,000 to 5,000 annotated examples typically produce competitive performance, with diminishing returns beyond that point.5

For MedDRA coding — mapping free-text descriptions to preferred terms — the data requirements depend on the vocabulary size and the specificity of the target terms. Mapping to the 100 most common preferred terms requires less data than mapping to the full 80,000-term MedDRA dictionary. A practical approach is to start with the most frequent terms, evaluate performance, and expand the scope as more annotated data becomes available.

The most important data quality factor is annotation consistency. A thousand examples with clean, consistent labels will produce a better model than ten thousand examples with noisy or contradictory annotations. In pharmacovigilance, where coding decisions can be subjective and where different reviewers may code the same event differently, ensuring inter-annotator agreement is often more important than increasing dataset size.

The performance curve is not linear

The relationship between training data size and model performance typically follows a logarithmic curve. Early additions of data produce large improvements. Later additions produce progressively smaller gains. This has a practical implication: the first 500 annotated examples are far more valuable than the next 500.

That is why a phased approach often works well. Start with a small, high-quality dataset. Fine-tune. Evaluate. Identify the failure modes. Annotate additional examples that specifically target those failure modes. Fine-tune again. This iterative approach is more efficient than trying to build a comprehensive dataset upfront, and it produces a model whose training data directly addresses its weaknesses.

For pharmacovigilance, this iterative strategy also aligns with how safety knowledge evolves. New drugs enter the market. New adverse events are identified. The MedDRA dictionary is updated twice a year. A fine-tuning pipeline that supports incremental updates is more sustainable than one that requires a complete retrain each time the task changes.

What to fine-tune on

The choice of base model matters more than most people expect. Not all small models are created equal, and the best base for fine-tuning depends on the task and the domain.

For pharmacovigilance tasks, biomedical pre-trained models — those trained on PubMed, clinical notes, or drug label corpora — tend to produce better results after fine-tuning than models pre-trained only on general web text. The base model's existing familiarity with medical terminology, drug names, and clinical language means that the fine-tuning step has less work to do.5

Among more general models, the current sweet spot for fine-tuning in domain-specific tasks appears to be in the 3B to 8B parameter range. Models in this range are large enough to have strong language capabilities but small enough to fine-tune on realistic hardware. The Phi-3.5 results in pharmacovigilance are a concrete example of what this class of model can achieve with targeted adaptation.1

Models smaller than 1B parameters can work for very narrow tasks — binary classification, simple extraction — but tend to struggle with the linguistic complexity that pharmacovigilance text often presents. Models larger than 13B parameters offer diminishing returns for most drug safety tasks and significantly increase infrastructure requirements.

The auditability advantage

Beyond performance and cost, fine-tuned small models have an advantage that is especially important in a regulated domain: auditability.

A fine-tuned model has a fixed set of parameters, a documented training dataset, a reproducible training process, and deterministic inference behaviour (when temperature is set to zero). You can version-control the model, the training data, and the training configuration. You can reproduce exactly the same outputs from exactly the same inputs. You can inspect what the model was trained on and verify that it was not exposed to data it should not have seen.

None of this is true for a large model accessed through an external API. The provider may update the model at any time. The inference behaviour may change without notice. The training data is proprietary. And the outputs may be non-deterministic by default.

For pharmacovigilance teams that need to defend their automated processes to regulatory auditors, the ability to point to a specific model version trained on a specific dataset and produce reproducible outputs is not a nice-to-have. It is a requirement that shapes the entire technology choice.6

The honest assessment

Fine-tuning small models is not always the answer. If your task is broad, your data is limited, or your inputs are highly variable, a well-prompted large model may still be the better choice. If you need to handle many different safety tasks with a single system, maintaining separate fine-tuned models for each one becomes a management burden.

But for the many pharmacovigilance tasks that are well-defined, data is available, and deployment constraints are real — adverse event extraction, case triage, MedDRA coding, seriousness classification — fine-tuning a small model is often the most practical path to a system that is accurate, affordable, auditable, and deployable in a regulated environment.

The field has spent a lot of energy asking which model is the most capable. The more useful question, in most cases, is which model can be shaped to do exactly what you need.


  1. De Vito G, Ferrucci F, Angelakis A. Enhancing medication safety with LLMs. Ital-IA 2025: 5th National Conference on Artificial Intelligence. 2025. 

  2. Hu EJ, Shen Y, Wallis P, et al. LoRA: Low-rank adaptation of large language models. Proceedings of the International Conference on Learning Representations (ICLR). 2022. 

  3. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient finetuning of quantized language models. Advances in Neural Information Processing Systems (NeurIPS). 2023. 

  4. Sun C, Qiu X, Xu Y, Huang X. How to fine-tune BERT for text classification. China National Conference on Chinese Computational Linguistics. 2019:194–206. 

  5. Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–1240. 

  6. Hakim JB, Painter JL, Ramcharran D, et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Scientific Reports. 2025;15:27886. doi:10.1038/s41598-025-09138-0. 

Back to blog