Long context vs. retrieval: should you just paste the whole PDF?

The 2026 temptation

For years, the context window was the hard limit on what you could ask a model about. That ceiling has lifted. By 2026 frontier and open-weight models advertise context windows of 128K, 200K, even a million tokens, and a million tokens is roughly a few thousand pages of text. So a reasonable question follows: if the model can read the whole document set at once, why build retrieval at all? Why not paste the entire PDF (or the whole shelf of PDFs) into the prompt and ask your question?

It is a fair question. The honest answer is "sometimes." This article is about telling the two cases apart, because the wrong default gets expensive fast.

Why "paste everything" is seductive

The appeal is real, and it is mostly about simplicity.

There is no pipeline to build. No chunking strategy, no embeddings, no vector database to host, no re-indexing job to schedule. You concatenate the text and send it.
The model sees all of it. Nothing was filtered out by a retrieval step that might have dropped the one passage you needed. The full document is right there in front of the model.
It is the shortest path to a working answer. For a single file and a one-off question, pasting it in is genuinely the right move.

When the simplest thing works, do the simplest thing. The catch is that "paste everything" stops being the simplest thing the moment your use case grows past one document and one question.

Why it often loses

Cost scales with every query. On a cloud API you pay per input token, on every request. If your corpus is 300K tokens and you ask forty questions a day, you are re-reading those 300K tokens forty times, whether or not the answer lived in a single paragraph. Retrieving only the relevant section turns that into a few thousand tokens per query. The difference compounds quickly.

Latency scales too. A model has to process the entire input before it produces the first token of output: the prefill step. More input means a longer wait. On a cloud endpoint that is noticeable; on local hardware, where you trade rented GPUs for your own, prefilling a quarter-million tokens for every question can be the difference between a snappy tool and one nobody wants to use.

Lost in the middle. This is the subtle one. Models do not attend uniformly across a long context. A well-documented finding, often called the "lost in the middle" effect, is that information buried in the middle of a very long input is recalled less reliably than the same information near the beginning or end. The practical consequence: the usable portion of a long context window is usually smaller than the headline number. A 200K window does not guarantee 200K tokens of equal attention. Context engineering treats this as a first-class design constraint rather than a footnote.

No provenance, no determinism. When you dump an entire corpus into the prompt and get an answer back, you often cannot say which passage grounded it. That matters for two reasons. First, citations: regulated work (legal, medical, financial) needs to point at the exact source text, and a full-context dump makes that hard to reconstruct. Second, counting: if you ask "how many invoices exceed the threshold?", a model skimming a giant context may approximate. Targeted retrieval narrows the input to the relevant records and supports a deterministic count over them. (The deeper version of this argument, where the model finds the structure and code does the counting, is its own topic.)

When long context genuinely wins

Being fair to the other side: there are real cases where pasting it all is the right call.

A single, medium-sized document. One contract, one paper, one report that comfortably fits the window. Building retrieval infrastructure for it would be over-engineering.
Global synthesis. "Summarize the argument of this whole book," "find every place these two chapters contradict each other," "what is the overall tone?" These need the model to hold the entire document at once. Retrieval, which by design fetches parts, is the wrong tool here.
One-off questions. If you will ask this once and never again, the cost and latency of a full-context read are irrelevant. There is nothing to amortize a pipeline against.

Long context is a genuine capability, not a trap. The mistake is treating it as a substitute for retrieval rather than a complement to it.

When retrieval wins

The balance flips as soon as scale or repetition enters the picture.

Large or growing corpora. A whole knowledge base does not fit in any window, and even when it nearly does, lost-in-the-middle erodes the benefit.
Repeated queries. Ask the same corpus a hundred questions and re-reading it a hundred times is pure waste. Retrieve the relevant slice each time instead.
Exact, cited, auditable answers. When you need to show which sentence supports a claim, retrieval gives you provenance by construction.
Bounded, structured knowledge. A converted manual, a policy set, a product spec, all organized so a targeted search lands on the right section every time. This is the regime where retrieval without a vector database shines: grep and direct reads over plain files, no index to maintain.

pdf2okf's take

pdf2okf sits squarely on the "use the window deliberately" side. It converts a PDF into OKF-compatible markdown (structured concept files an agent can search by filename and heading) and feeds the model only the concepts a question actually touches, along with the relevant figures. There is no vector database; retrieval is just grep over plain files. (OKF is Google's Open Knowledge Format standard; pdf2okf is compatible with it, not its author.)

The result is the best of both worlds. Answers stay exact and cited because the model is grounded in specific passages, not a probabilistic blur of the whole corpus. You do not pay to re-read everything on every query. And because retrieval is targeted, the context window is spent on signal, not filler. That is precisely how you avoid lost-in-the-middle. When a query genuinely needs to traverse several concepts, agentic retrieval reads them one after another and synthesizes.

Long context and retrieval are not rivals. Long context is the room; retrieval decides what to put in it. The lesson of 2026 is not "the window is big enough, stop retrieving." It is "the window is big enough that how you fill it is now the whole game." A bigger context window is a reason to be more deliberate about what goes in, not less. pdf2okf uses it deliberately, never as a dumping ground.