Wiki
OKF vs. a vector database: two ways to give an AI your documents
Two ways to give an AI your documents
The challenge is the same in both cases: a language model has a fixed context window. Most documents will not fit. You need a way to put the right information in front of the model at query time. A vector database and an OKF bundle are two fundamentally different answers to that problem.
How a vector database works
A vector database pipeline has five steps. First, split the source documents into chunks: overlapping text segments of a few hundred tokens each. Second, run each chunk through an embedding model; this converts the text into a dense numeric vector, a list of hundreds or thousands of floating-point numbers. Third, store those vectors in a specialized database: Pinecone, Weaviate, Chroma, Qdrant, or a self-hosted Milvus. Fourth, at query time, embed the incoming question the same way and run a similarity search: find the top-k vectors closest to the query vector in high-dimensional space. Fifth, those chunks land in the LLM prompt and the model answers.
The system works at scale. The problem is what it costs to operate.
The vectors are opaque. A sequence of floating-point numbers encodes the meaning of "force majeure clause" in a way no human (and no audit tool) can read directly. When retrieval goes wrong, debugging means comparing cosine-similarity scores.
You host the database. Whether it is a managed cloud service or a self-run container, it is a stateful system: credentials, backups, scaling, schema migrations.
You re-embed on every change. A document updated? Re-chunk, re-embed, upsert into the index. Miss a cycle and the index silently drifts from reality with no out-of-sync warning.
You cannot share the index easily. Vectors are model-specific. A different embedding model means a different index. A collaborator needs the source documents and a running database, not just the knowledge.
How an OKF bundle works
An OKF bundle (following the format Google published as the Open Knowledge Format on 2026-06-12) stores knowledge as plain markdown concept files with YAML frontmatter. There are no vectors. There is no database. There is a directory of text files you can open in any editor.
At query time, an agent greps the bundle: scan filenames and headings, read the matching concept files, follow cross-references if needed. The agent navigates the way a developer reads a codebase: directly, with file I/O, without an opaque intermediary index.
pdf2okf produces OKF-compatible bundles. The knowledge is human-readable, version-controlled in git, portable as a zip file, and shareable without any database or embedding model.
Side-by-side comparison
| Property | Vector database | OKF bundle | |---|---|---| | Storage format | Opaque numeric vectors | Plain markdown + frontmatter | | Human-readable | No | Yes, open in any editor | | Hosting required | Yes, DB process or managed service | No, a directory of files | | Re-embedding on change | Yes, mandatory pipeline step | No, edit and save | | Portable / shareable | Model-specific, DB-specific | Any agent, any machine | | Version control | External / manual | Native (git diff works) | | Vendor lock-in | High, tied to DB and embedding model | None, plain files | | Debugging retrieval | Similarity scores | Read the file directly |
When a vector database still makes sense
Vector databases earn their complexity at genuine scale: millions of heterogeneous, unstructured documents where the corpus is too large to navigate by filename and heading. A medical literature archive, a legal discovery set spanning hundreds of thousands of contracts, a multilingual customer-support history: these are the regimes where embedding-based similarity search earns its operational overhead.
If the corpus is bounded, structured, and owned (a product specification, a policy manual, a technical knowledge base, a curated set of research documents), structure beats embeddings. You do not need a haystack search when the knowledge is already organized. At this scale, an agent navigating concept files by heading is faster, cheaper, and more exact than cosine similarity over a probabilistic index.
Scale is not always a virtue. A well-structured bundle is a feature, not a limitation.
Where pdf2okf fits
pdf2okf converts PDFs into OKF-compatible bundles: structured markdown concept files that any agent can grep directly. No vector database to run, no embedding pipeline to maintain, no index to re-sync when source documents change. The knowledge stays legible, portable, and fully under your control.
That is not a workaround. It is the right architectural choice for bounded, sovereign knowledge, and the approach that Anthropic's own flagship developer tool, Claude Code, adopted when it dropped vector search in favor of direct file navigation.