How LLMs handle (and mishandle) medical terminology and ontologies

There is a moment in almost every pharmacovigilance LLM project where someone discovers that the model “understands” a medical concept but cannot reliably map it to the right controlled term. The model knows what hepatotoxicity means. It can discuss it fluently. It can distinguish it from nephrotoxicity. But when you ask it to produce the correct MedDRA preferred term for a specific adverse event description, it sometimes invents a term that does not exist, picks a term from the wrong hierarchy level, or conflates two related but distinct concepts.

That gap — between language understanding and ontology compliance — is one of the most practically important failure modes for LLMs in drug safety. And it starts at a level most people never think about: tokenization.

How LLMs tokenize medical terms versus how MedDRA organises them, showing the mismatch between subword tokens and ontology structure
Figure 1. The same medical concept seen through two lenses: LLM tokenization breaks terms into subword pieces based on training frequency, while MedDRA organises them into a rigid hierarchy. The model's internal representation of a term may bear no structural relationship to its position in the ontology.

The tokenization problem no one talks about

Large language models do not process text as words. They process it as tokens — subword units derived from the training corpus. The exact tokenization depends on the model’s vocabulary, which is built from statistical patterns in the data it was trained on.1

For common English words, this is rarely a problem. But medical terminology is different. Drug names, chemical compounds, and clinical terms follow naming conventions that are often poorly represented in general-purpose training data. A term like “rhabdomyolysis” might be tokenized as “rhab” + “domy” + “ol” + “ysis”, splitting the morphological components in ways that lose their medical meaning. A drug name like “pembrolizumab” might be broken into fragments that share subword tokens with completely unrelated terms.2

This matters because the model’s ability to reason about a medical concept depends in part on how coherently it can represent that concept internally. If the tokenizer fragments a term in a way that does not preserve its meaningful components, the model’s “understanding” of that term is built on a noisier foundation than it would be for a common English word.

In pharmacovigilance, this is not just an academic concern. When an extraction system needs to correctly identify the drug name “lixisenatide” in a case narrative, the tokenization determines how the model initially perceives that string. If the tokens do not correspond to meaningful morphemes, the model has to work harder — and may fail more often — to recognise the term as a single entity.

MedDRA is not a language. It is a hierarchy.

The Medical Dictionary for Regulatory Activities, MedDRA, is the controlled terminology used for adverse event coding in pharmacovigilance globally. It has a rigid five-level hierarchy: System Organ Class, High Level Group Term, High Level Term, Preferred Term, and Lowest Level Term. Each concept has a fixed position in this tree, and the distinctions between neighbouring terms can be subtle.3

LLMs were not trained to navigate this hierarchy. They were trained to predict the next token. When a model produces a MedDRA-like term, it is generating a plausible-sounding string based on patterns in its training data. It is not performing a lookup against the official MedDRA dictionary. That means the output may look correct — “drug-induced liver injury” sounds like a real preferred term — but may not match the exact term that MedDRA uses, or may correspond to a different hierarchy level than intended.

This is what I mean by term drift. The model generates something close enough to be convincing but not exact enough to be usable in a regulatory submission. In a domain where the difference between “hepatic failure” and “hepatic failure acute” matters for case classification and signal detection, close is not good enough.

The “understanding” illusion

It is tempting to conclude from a model’s fluent discussion of a medical topic that it truly understands the underlying concepts. And in some functional sense, it does. A well-trained LLM can explain the mechanism of drug-induced liver injury, distinguish it from other forms of hepatotoxicity, and discuss its clinical presentation. That is genuinely useful.

But understanding a concept in natural language and correctly mapping it to a specific node in a controlled vocabulary are different cognitive tasks. The first requires semantic knowledge. The second requires ontological precision — knowing exactly where a concept sits in a predefined classification system, and respecting the boundaries between adjacent terms.

Research on adverse event extraction from drug labels has shown that LLM performance varies significantly depending on the linguistic complexity of the target concept. Negated terms, discontinuous mentions, and terms that span multiple phrases are harder to extract correctly. The same pattern likely applies to coding: the more ambiguous or context-dependent the term, the more likely the model is to produce an imprecise mapping.4

Where this breaks in practice

The practical consequences show up in several places. First, in automated coding workflows, where an LLM is asked to assign MedDRA preferred terms to free-text adverse event descriptions. If the model invents a term that does not exist in MedDRA, the downstream system has to either reject the output or attempt a fuzzy match, both of which add complexity and error risk.

Second, in extraction pipelines where the boundary between two related concepts determines case classification. Is a reported event “injection site reaction” or “injection site pain”? The distinction matters for aggregate analysis, but from a language model’s perspective, these are nearly identical concepts separated by a fine ontological line.

Third, in signal detection, where term-level precision affects disproportionality calculations. If a model consistently maps events to a slightly different preferred term than a human coder would, the resulting database contains systematic coding drift that can inflate some signals and suppress others.

Guardrails and grounding strategies

The most promising approaches to this problem do not try to make the LLM itself ontology-aware. Instead, they constrain the model’s output space so that it can only produce valid terms.

Retrieval-augmented generation is one such approach. Instead of asking the model to recall MedDRA terms from its parametric memory, the system retrieves candidate terms from the MedDRA dictionary and asks the model to select the best match from a constrained set. This converts the task from open-ended generation to informed selection, which is inherently less prone to hallucination.5

Another approach is post-processing validation: let the model generate freely, then check its output against the official dictionary and flag or correct mismatches. This is less elegant but practically effective, especially when combined with confidence scoring that routes uncertain cases to human review.6

A third direction is fine-tuning on ontology-specific data. Models trained on corpora that include MedDRA-coded examples learn the mapping between natural language descriptions and controlled terms more reliably than general-purpose models. The trade-off is that fine-tuning requires curated training data and periodic updates as the terminology evolves.7

The deeper issue: ontologies evolve, models do not

MedDRA is updated twice a year. Terms are added, deprecated, merged, and reclassified. A model trained or fine-tuned on version 26.0 may produce outputs that are inconsistent with version 27.1. That version drift is manageable in a retrieval-augmented system, where the dictionary can be swapped out. It is much harder to manage in a system that relies on the model’s internal knowledge of the terminology.

This is one of the quieter arguments for keeping ontological knowledge outside the model and in the retrieval layer. The model handles language. The knowledge base handles structure. When the structure changes, you update the knowledge base, not the model.

What this means for pharmacovigilance teams

If you are building an LLM-based system for any part of the pharmacovigilance workflow that touches MedDRA coding, the single most important design decision is not which model to use. It is how to constrain the model’s output to valid terms.

Do not trust the model to produce correct MedDRA terms from memory. Do not assume that because the model can discuss a medical concept fluently, it can code it correctly. And do not evaluate your system’s coding accuracy using only surface-level metrics. Measure exact-match accuracy against the dictionary. Check for term-level drift across hierarchy levels. Test on edge cases where two preferred terms are semantically close but ontologically distinct.

The models are good at language. They are not good at being dictionaries. The system design needs to account for that difference.

  1. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016:1715–1725. 

  2. Zhang K, Jiang X, et al. BiomedBERT-based tokenization and its impact on biomedical NLP tasks. Journal of Biomedical Informatics. 2024;150:104599. 

  3. ICH. MedDRA Introductory Guide, Version 27.0. International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. 2024. 

  4. Gisladottir U, Zietz M, Kivelson S, et al. Leveraging large language models in extracting drug safety information from prescription drug labels. Drug Safety. 2025. doi:10.1007/s40264-025-01594-x. 

  5. Wu L, Qu Y, Xu J, et al. A framework enabling LLMs into regulatory environment for transparency and trustworthiness and its application to drug labeling document. Regulatory Toxicology and Pharmacology. 2024;148:105597. 

  6. Hakim JB, Painter JL, Ramcharran D, et al. The need for guardrails with large language models in pharmacovigilance and other medical safety critical settings. Scientific Reports. 2025;15:27886. doi:10.1038/s41598-025-09138-0. 

  7. De Vito G, Ferrucci F, Angelakis A. Enhancing medication safety with LLMs. Ital-IA 2025: 5th National Conference on Artificial Intelligence. 2025. 

Back to blog