Context engineering: the discipline that contains RAG

Prompt engineering has a scope problem

Prompt engineering assumes the main variable is what you write in the text box. The real variable is larger: everything that enters the model's context window (documents, memory, tool outputs, conversation history, system instructions) and the decisions about when each piece arrives. That broader discipline has a name now: context engineering.

What context engineering actually means

A language model is, at its core, a function of its context window. It cannot see anything outside that window. Everything it knows at inference time (facts, constraints, examples, retrieved content) was placed there deliberately or accidentally. Context engineering is the deliberate side: deciding exactly what goes in, in what order, and when.

The term emerged in the AI practitioner community around 2026, as teams moved beyond prompt templates toward systematic management of context composition. Andrej Karpathy, among others, described it as the successor framing to prompt engineering: a broader discipline that subsumes it rather than replaces it.

Prompt engineering is one lever inside context engineering. RAG is another. So is dynamic tool calling, memory summarization, conversation pruning, system prompt structuring, and retrieval-then-rerank. Context engineering is the meta-layer that decides which levers to pull and when. It is the answer to the question "what should the model know, and when should it know it?"

Why RAG is a tool, not the discipline itself

RAG (Retrieval-Augmented Generation) solves one specific context problem: how to inject relevant external content into the prompt when the corpus is too large to include wholesale. It is a useful technique. It is not a complete solution to context quality.

RAG as typically implemented still makes coarse decisions. It retrieves a fixed number of chunks regardless of query complexity. It passes those chunks to the context in whatever order the similarity search produced. It has no built-in mechanism to detect that the wrong chunk was retrieved or that the right answer is in a section the embedding model ranked low.

A context engineering lens makes these limitations visible. The question shifts from "did we do RAG?" to "does the model have exactly what it needs, and nothing it does not need?" Those are different questions with different answers.

Long-context degradation: why stuffing everything in fails

Longer context windows are genuinely useful. But they do not make context quality irrelevant. Research from multiple groups has documented a consistent pattern: model performance degrades on information placed in the middle of very long contexts. This effect is sometimes called the "lost in the middle" problem. A model with a 128K context window is not uniformly attentive to all 128K tokens. Material near the edges receives more weight than material buried in the center.

This means loading the entire knowledge base into the context is not a strategy. It is a hope. If a relevant section lands in the middle of a 100K-token context alongside thousands of irrelevant tokens, the model may not weight it correctly. Targeted retrieval of the few right things consistently outperforms a full-corpus dump, even when the corpus technically fits in the window.

Context quality is not just about what the model can see. It is about what the model can use.

Why grep over an OKF bundle is good context engineering

An OKF bundle (structured markdown concept files organized by topic) is built to support exact, targeted retrieval. An agent searching an OKF bundle does not retrieve chunks by cosine similarity. It reads filenames and headings, identifies the relevant concept file, and reads only that file.

The context the model receives is small and precise: the few concept files the answer actually requires, not a probabilistic selection of text fragments assembled by an embedding model trained on different data. This is context engineering applied concretely: put the right things in, leave the noise out.

It is also deterministic. For a given query over a given bundle, the same concept files will be retrieved. The answer is auditable. You can inspect what went into the context and why.

The contrast with top-k chunk retrieval matters: in a vector search, two semantically similar queries might retrieve different chunks depending on how the embedding model distributes them. In an agent reading an OKF bundle, the relevant concept file is either there or it is not. The reasoning is transparent.

What this means for knowledge base design

The shift from "do RAG" to "practice context engineering" is a shift in what you optimize. Instead of tuning similarity thresholds or chunk sizes, you ask: is the knowledge organized so that a targeted search finds exactly what any reasonable question requires?

For a bounded, structured knowledge base (a product specification, a policy document, a technical manual converted from PDF), the answer is almost always yes, when the conversion is done well. The organizing work happens once, at conversion time, not on every query.

pdf2okf converts PDFs into OKF-compatible bundles designed for exactly this kind of retrieval. The structure produced during conversion (concept files, consistent headings, cross-references) is the context engineering infrastructure. An agent reading that bundle does not search a haystack. It opens the right section.