pdf2okf·

Wiki

What is the Open Knowledge Format (OKF)?

A standard for AI knowledge: published, not invented here

On 2026-06-12, Google Cloud published the Open Knowledge Format (OKF) v0.1: an open, vendor-neutral specification for the knowledge an AI model or agent reads. The surprising part is how little there is to it. An OKF knowledge base is a directory of Markdown files with YAML frontmatter. That's the whole substrate. No proprietary account, no SDK, no managed runtime that has to be alive for your data to mean anything. If you can open a folder of text files, you can read an OKF knowledge base. So can an agent.

That modesty is the design. The format is human- and agent-readable by the same mechanism (plain prose with a little structure on top), so the people who write the knowledge and the models that consume it are looking at the same artifact.

What the spec actually says

OKF is small enough to hold in your head. A knowledge base is a folder; each concept is a Markdown file; each file opens with a YAML frontmatter block describing it.

  • type is the one required field. It tells a reader, human or agent, what kind of thing the file represents.
  • title, description, tags, and timestamp are recommended: enough metadata to navigate, filter, and order a corpus without inventing a schema per project.
  • Two filenames are reserved: index.md as the entry point to a knowledge base, and log.md for an append-only record of changes.

There is no database, no index server, no binary blob. The structure is the filesystem; the metadata is text you can read. Validation is "does the frontmatter parse, and is type present," not "did the vendor's service accept it."

Why this shape, and why it feels familiar

OKF didn't arrive from nowhere. It is intentionally close to patterns developers already trust: the LLM-wiki repositories teams keep for their agents, and personal Markdown knowledge bases like Obsidian. The bet is that the right container for machine-read knowledge looks a lot like the container that already works for human-read knowledge (Markdown plus frontmatter), and that standardizing it beats every product inventing its own house format.

That familiarity is what makes the standard useful rather than ceremonial. Nobody has to learn a new mental model; they have to agree on a handful of field names.

A format is not a platform

The reason this matters is ownership, and it's easiest to see by contrast. The default way to feed documents to an AI is to chunk them, embed the chunks into vectors, and load those vectors into a vector database. That works, but look at what you end up holding: an opaque numeric index, tied to the embedding model that produced it, hosted by something that has to keep running, and unreadable to a human. Change the embedding model and you re-embed everything. Move providers and you migrate or rebuild. The knowledge isn't yours in any portable sense. It's trapped inside infrastructure.

A format inverts that. Because OKF is just files:

  • It's portable. Copy the folder; it works anywhere. No export job, no migration.
  • It's version-controlled. It lives in git. You can diff it, review a change, roll it back.
  • It has no lock-in. There's no account to keep, no runtime to keep alive, no vendor whose service is load-bearing for your own data to stay legible.

The knowledge is the files. If you can read a folder, you can read your knowledge base. If you can copy a folder, you can move it. That is a categorically different relationship to your own information than "it's in the vector DB."

Where pdf2okf fits

To be precise about credit: we didn't invent OKF. Google Cloud published it, and we're glad they did. A vendor-neutral, public spec is worth more to the whole field than any single product's private format. What pdf2okf does is produce OKF-compatible output.

Point pdf2okf at a PDF and it builds an OKF-compatible bundle of small, linked concept files: text, tables, and diagrams turned into explicit, greppable Markdown with frontmatter. It does this on your hardware, or against your own key: no page is uploaded to anyone. The standard defines the shape; pdf2okf is the sovereign, self-hosted way to get your documents into that shape. Compatibility with the standard is the point; ownership of the process is the difference.

From a format to a bundle

A format tells you what a single knowledge base looks like. The next question is what happens when you want to move one: version it, hand it to a colleague, commit it to a repo, give it to another agent. That's where our own extension comes in: the OKFZ, the portable, shareable knowledge bundle, built once from your PDF and shared without re-processing or a vector database in tow. The format makes your knowledge legible; the bundle makes it travel.

Sources

pdf2okf.com

Be there when it opens.

pdf2okf is in private build, self-hosted, sovereign. Leave an email and you'll be first in.