May 1, 2026

Context windows, chunking strategies, and what gets lost when safety documents are too long

A full prescribing information document for a marketed drug is often 20 to 40 pages long. Some exceed 60. They contain warnings, adverse reaction tables, clinical trial summaries, pharmacokinetic data, drug interaction sections, and use-in-specific-populations guidance. All of it is safety-relevant. Much of it is cross-referenced internally. A warning in section 5 may depend on context established in section 6. An adverse reaction listed in section 6.1 may be qualified by language in section 5.1.

When an LLM is asked to process such a document, the first practical question is whether it can even see all of it at once. And even when the context window is technically large enough, a growing body of evidence suggests that models do not attend to all parts of a long input equally. They are better at the beginning and end. They are worse in the middle. And the middle is often exactly where the safety-critical information lives.

Diagram showing the lost-in-the-middle effect: LLM attention and retrieval accuracy are highest at the beginning and end of long inputs, with a drop in the middle where important safety information often resides — Figure 1. The "lost in the middle" effect. In long-context LLM processing, retrieval accuracy and attention tend to be highest for information at the beginning and end of the input, with a measurable drop for content in the middle of the context window. In drug labeling, safety-critical sections like Warnings and Precautions often fall in this low-attention zone.

The context window is not what you think it is

Modern LLMs have context windows ranging from 8,000 to over 1,000,000 tokens, depending on the model. That might suggest that document length is a solved problem. But context window size describes the maximum input the model can accept, not the maximum input it can process reliably.

A study by Liu and colleagues demonstrated what is now called the “lost in the middle” effect. They showed that when relevant information is placed in the middle of a long input context, language model performance degrades significantly compared to when the same information appears at the beginning or end. This pattern held across multiple model families and multiple tasks.¹

For pharmacovigilance, this is a direct practical concern. A full drug label with safety information scattered across multiple sections cannot be reordered to put everything important at the beginning. The structure of the document is defined by regulatory convention, not by what the model finds easiest to attend to.

Why chunking matters

Given these limitations, most practical LLM systems for document processing use chunking: breaking the input into smaller segments, processing each segment separately, and then combining the results. This is standard practice in retrieval-augmented generation, where documents are split into chunks, indexed, and retrieved in response to queries.²

But chunking introduces its own problems, and in pharmacovigilance, those problems are not trivial.

The most fundamental issue is that chunking can break cross-references. A warning that says “see section 6.1 for rates of hepatotoxicity observed in clinical trials” only makes sense if the model can see both section 5 and section 6.1. If those sections are in different chunks, the model processes each one in isolation and may miss the connection between them.

This is particularly important for drug labels, where the relationship between the Warnings and Precautions section and the Adverse Reactions section is often the most safety-relevant part of the document. A model that reads the warnings without the supporting adverse reaction data, or vice versa, is working with an incomplete picture.³

How you chunk determines what you find

The choice of chunking strategy has a direct impact on extraction quality. Consider three common approaches.

Fixed-size chunking splits the document into segments of uniform token length, typically with some overlap between adjacent chunks. This is the simplest approach and the most common in general-purpose RAG systems. Its main disadvantage is that it ignores document structure. A chunk boundary can fall in the middle of a sentence, in the middle of a table, or between a header and its associated content.

Section-based chunking splits the document along structural boundaries — headers, section numbers, or labeled divisions. For drug labels, which follow a standardised section numbering system, this is more natural. Each section becomes a chunk. The disadvantage is that sections vary enormously in length. The Adverse Reactions section of a label can be thousands of tokens long, while the Contraindications section may be a single paragraph.

Semantic chunking uses the model itself or a separate embedding model to identify natural topic boundaries and create chunks that correspond to coherent segments of meaning. This is the most sophisticated approach but also the most computationally expensive and the hardest to validate.

Each strategy makes different trade-offs between preserving context and fitting within processing limits. And each one creates different blind spots. Fixed-size chunking may split a safety-relevant passage across two chunks, diluting it in both. Section-based chunking may isolate a critical cross-reference into a separate chunk where it loses its context. Semantic chunking may group information in ways that do not align with the regulatory structure of the document.

The overlap problem

A common mitigation for chunking artifacts is to include overlap between adjacent chunks. If chunks overlap by, say, 200 tokens, then information near a chunk boundary appears in both the current chunk and the next one. This reduces the chance that a passage is split in a way that destroys its meaning.

But overlap is not free. It increases the total number of tokens processed, which increases cost and latency. It can also create duplication artifacts: if the same adverse event mention appears in two overlapping chunks, the extraction system may count it twice unless there is deduplication logic downstream.

In pharmacovigilance, duplicate extraction is a real problem. If a model processes a label in overlapping chunks and extracts the same adverse event from two chunks, the downstream database will contain a duplicate entry unless the pipeline explicitly handles it. Duplicate adverse events in a safety database can inflate disproportionality scores and distort signal detection.

Long case narratives present different challenges

Drug labels are at least structurally predictable. Case narratives — the free-text summaries that accompany individual case safety reports — are not. They vary in length from a few sentences to several pages. They are written in different styles by different reporters. They may include medical history, concomitant medications, event descriptions, lab results, and follow-up information, all in a single unstructured paragraph.

For long case narratives, the chunking challenge is different. There is no standard section structure to split on. The temporal sequence of events matters — the order in which symptoms appeared, drugs were administered, and outcomes were observed is often essential for causal assessment. A chunking strategy that breaks a narrative in the middle of a temporal sequence can make it impossible for the model to assess whether an event followed drug exposure or preceded it.

This is one of the quieter arguments for processing entire case narratives as single inputs whenever possible, even if it means using a larger model or accepting higher processing costs. The cost of a missed temporal relationship can be greater than the cost of extra compute.

What this means in practice

For teams building LLM pipelines for pharmacovigilance documents, several principles follow from these observations.

First, test your chunking strategy on the actual document types your system will process. A chunking approach that works well on biomedical literature may perform poorly on drug labels or case narratives, because the information density, cross-referencing patterns, and structural conventions are different.

Second, measure extraction quality as a function of information position within the document. If your system consistently misses adverse events that appear in the middle sections of long labels, that is a chunking or attention problem, not a model capability problem.

Third, preserve cross-references wherever possible. If your document contains explicit cross-references between sections, consider strategies that keep referenced sections together in the same processing context. This may mean creating composite chunks that include a section and its referenced content, even at the cost of larger chunk sizes.

Fourth, deduplicate at the pipeline level. If your chunking strategy uses overlap, or if the same information appears in multiple sections of a document, your extraction pipeline needs explicit deduplication logic before results are written to a safety database.

And fifth, be realistic about what long-context models actually deliver. A 128,000-token context window does not mean the model will reliably process every token. The lost-in-the-middle effect is real, and it is not fully solved by larger context windows. It is a property of how attention mechanisms work, and it will likely persist as a practical concern even as context windows grow.

The bottom line

Context windows are getting larger. That is good. But larger context windows do not eliminate the need for thoughtful document processing strategies. How you chunk a drug label or a case narrative is not a minor implementation detail. It is a design decision that directly affects what your system sees, what it misses, and whether the downstream safety database reflects the source material accurately.

In pharmacovigilance, the information you lose to a bad chunking strategy is often the information you needed most.

Liu NF, Lin K, Hewitt J, et al. Lost in the middle: how language models use long contexts. Transactions of the Association for Computational Linguistics. 2024;12:157–173. ↩
Gao Y, Xiong Y, Dibia V, et al. Retrieval-augmented generation for large language models: a survey. arXiv. 2024. arXiv:2312.10997. ↩
Gisladottir U, Zietz M, Kivelson S, et al. Leveraging large language models in extracting drug safety information from prescription drug labels. Drug Safety. 2025. doi:10.1007/s40264-025-01594-x. ↩

Back to blog