RAG without a vector database: grep-based retrieval

What classic RAG actually does

Classic Retrieval-Augmented Generation (RAG) follows a five-step pipeline. Take a source document, split it into chunks, run each chunk through an embedding model to produce a dense numeric vector, store those vectors in a dedicated vector database (Pinecone, Weaviate, Chroma, Qdrant), then at query time embed the incoming question and retrieve the top-k most similar vectors. Those chunks land in the LLM prompt. The model answers.

It works. But the seams show.

The operational tax nobody mentions

Every step beyond "ask a question, get an answer" has a cost that vendors undersell.

A vector database to run. Whether cloud-managed or self-deployed, it is another moving part: credentials, availability SLAs, pricing that scales with index size, schema migrations when your chunk strategy changes.

Re-embedding on every update. A document changes? Re-chunk, re-embed, upsert into the index. Miss a cycle and your index silently drifts from the source. There is no "out of sync" warning in most setups. You schedule re-indexing jobs or you fly blind.

Opaque failure modes. A numeric vector carries no semantics a human can read. When retrieval goes wrong (on short queries, domain jargon, or documents that discuss the same concept in different phrasing), debugging means comparing cosine-similarity scores. The failure is invisible until the answer is wrong.

Token inefficiency at the margins. Top-k retrieval fetches the same number of chunks regardless of query complexity. A simple yes/no lookup gets the same chunk-dump as a multi-hop reasoning task.

The alternative: let the agent grep

There is a fundamentally different model. Instead of pre-processing documents into numeric representations, you keep them as plain markdown concept files and let the agent search them directly, with grep, glob, and targeted reads.

The agent behaves like an experienced developer reading a codebase: it scans filenames and headings, reads the relevant section, follows a cross-reference if needed, and stops when it has enough context. It does not retrieve a fixed-k blob of possibly-irrelevant text. It retrieves what the query actually needs.

The advantages are concrete:

Nothing to host. The knowledge lives in files. Files are portable, versionable, copyable.
Nothing to re-embed. Edit the markdown, save. The agent sees the change on the next query. No pipeline to trigger.
Human-readable at every layer. A concept file is just a file. You can open it, audit it, diff it in git.
Fewer tokens for bounded corpora. On a structured, finite knowledge base, an agent navigates to the right file rather than pulling a flat chunk-dump. The context window stays lean.

The 2026 shift that legitimized this approach

Two public signals made grep-based retrieval harder to dismiss in early 2026.

Anthropic described how Claude Code operates internally: it dropped an embeddings and vector-DB pipeline in favor of agentic file search: grep, glob, and direct file reads. The tool that edits large codebases for millions of developers uses no vector index. It navigates.

Around the same time, LlamaIndex published a reframing of the standard RAG architecture: naive top-k vector retrieval (retrieve once, answer once) is the wrong unit of computation for complex queries. An agent that searches iteratively and refines what it reads before answering outperforms a single similarity lookup across a wide range of tasks. The phrase "RAG is dead, long live agentic retrieval" circulated widely because it named something practitioners had already noticed.

Neither claim means vector databases are obsolete. They mean the default assumption has shifted: indexing-first is no longer self-evidently correct.

When vector search still wins

Embedding-free RAG has an honest limit. At very large scale (millions of heterogeneous, unstructured documents), it burns more tokens and is slower than a pre-built index. Milvus, Weaviate, and their peers were built for exactly that regime. They earn their complexity at that scale.

But an OKF bundle is not that regime. An OKF bundle is a bounded, structured knowledge base: one document or one knowledge domain, organized into concept files with consistent frontmatter. At this scale, structure beats embeddings. The agent navigates by filename and heading, not by cosine distance. That navigation is exact, auditable, and deterministic.

Not every corpus needs to be a haystack. A well-structured bundle is a feature, not a limitation. Scale is not always a virtue.

This is exactly what pdf2okf does

pdf2okf converts a PDF into an OKF-compatible bundle of structured markdown concept files. An agent (Claude Code, Cursor, Hermes Agent, Odysseus, OpenClaw, or any MCP-capable tool) greps that bundle directly. No vector database. No embedding pipeline. No re-indexing job to schedule.

The PDF becomes something an agent can read the way a developer reads a repository: directly, precisely, without an intermediary index standing between the question and the answer.

That is embedding-free RAG. In 2026, it is not a workaround. It is the approach Anthropic's own flagship developer tool chose.