The hidden cost of RAG: re-embedding, hosting, and token bills

The pitch vs. the invoice

The pitch for cloud-based RAG is simple: upload your documents, ask questions, get answers. The operational reality is messier. Three cost centers are consistently undersold in vendor documentation, conference talks, and the first excited blog posts. They appear later: in infrastructure invoices, in engineering hours, in API bills that grow faster than the project's value.

Cost center 1: hosting the vector database

Every RAG pipeline needs somewhere to store vectors. In a self-hosted setup that means running a Chroma, Weaviate, Milvus, or Qdrant instance: a server or cluster, persistent storage, backups, uptime monitoring, and someone who knows how to upgrade it when the schema changes.

In a managed cloud setup, pricing is typically per stored vector, per query, or per index size, with a free tier that disappears the moment the index grows beyond proof-of-concept scale.

Either path carries ongoing cost: money for the managed service or engineering time for the self-hosted one. Unlike a database that holds your source of truth, a vector index holds a derived representation of it. When the source changes, the index must change too. The hosting cost does not go away when the project stabilizes. It compounds.

Cost center 2: re-embedding whenever documents change

A vector index is a snapshot in time. The moment a source document changes (a policy updated, a product spec revised, a clause corrected), the affected chunks need to be re-chunked, re-embedded, and re-upserted into the index.

This creates a maintenance obligation. In practice it means a scheduled pipeline: detect changed documents, re-process them, push the updates. Missing a cycle creates silent staleness: the index answers queries using the old version of a document while the source says something different. Most setups have no built-in drift warning. You discover the problem only when the model gives a confidently wrong answer.

The re-embedding cost compounds over time. Each individual embedding API call or local GPU cycle is small. For a living knowledge base (documentation that is actively maintained, policies that update on regulatory timelines, product content that changes frequently), the cumulative re-embedding cost is non-trivial, both in compute and in the engineering time needed to keep the pipeline reliable.

Cost center 3: tokens re-sent on every query

A vector retrieval system sends the retrieved chunks to the language model on every query. The size of those chunks multiplies directly against query volume. Even with deduplication, the same background context gets repackaged and transmitted repeatedly.

For a bounded knowledge base (a product FAQ, a technical specification, a document set that answers a defined class of questions), most of this is redundant. The relevant background stays largely the same while only the question changes. Every query pays the token cost of re-delivering context the model does not need to receive again.

At low query volumes this is invisible. At production scale (hundreds of queries per day, each pulling several chunks of a few hundred tokens), the token bill becomes a meaningful recurring expense. This is a cost that RAG system architectures frequently underestimate when projecting production costs.

What the OKF approach removes

Grep over a structured OKF bundle eliminates all three cost centers.

No database to host. An OKF bundle is a directory of plain markdown files. It lives on a filesystem. An agent reads it with standard file I/O. No database process, no managed service, no index to back up or migrate.

No re-embedding on change. When a concept file changes, it changes. The agent reads the current version on the next query. There is no pipeline to trigger, no re-indexing job to schedule, no sync to verify.

Fewer tokens per query. An agent searching an OKF bundle reads only the concept files the query actually requires: typically a small number of focused files, not a fixed chunk-dump of the entire knowledge base. For a bounded, structured corpus, navigating by heading and filename is significantly more token-efficient than top-k retrieval.

The approach trades away something real: at very large scales (many thousands of heterogeneous, unstructured documents), navigating by filename and heading costs more compute than a pre-built similarity index. That trade-off is genuine and worth understanding. For a bounded knowledge base, it rarely applies.

A different architecture, not a shortcut

The three hidden costs of RAG are not inevitable features of AI-assisted document retrieval. They are the cost of one specific architectural choice: pre-indexing content into an opaque numeric representation stored in a dedicated database.

A different architecture (structured plain-text concept files navigated by an agent directly) removes them by design. The choice is not about avoiding work. It is about where the work is done and who pays for it over time.

pdf2okf converts PDFs into OKF-compatible bundles: structured markdown concept files that any agent can read directly. No embedding pipeline to run, no vector database to provision, no re-indexing job to maintain. The converted knowledge stays lean, readable, and free of infrastructure overhead. The three costs that usually follow a RAG deployment simply do not appear on the invoice.