pdf2okf·

Wiki

Read your documents from any agent: CLI & MCP

Any agent that runs shell commands can read your documents

Most document AI tools are silos. You upload a PDF to a cloud service, the service chunks it into a proprietary vector store, and you query it through that service's API, or not at all. You cannot point your own AI agent at those chunks. You cannot run queries offline. You cannot inspect what the retrieval layer actually saw before it answered.

pdf2okf breaks that pattern. The tool converts a PDF into an OKF-compatible bundle: a directory of plain Markdown files, one per concept, plus a lightweight metadata index. Plain files on your local filesystem. And plain files are something every agent already knows how to read.

Build the bundle once, answer forever

The conversion step runs once:

pdf2okf convert my-document.pdf --output ./my-bundle/

The result is a directory of .md files. Nothing proprietary. No database. No service to keep running.

From that point forward, any agent or CLI tool that can grep or read files can query your knowledge:

grep -r "liability clause" ./my-bundle/concepts/

cat ./my-bundle/concepts/liability-clause.md

An agent running those commands gets back the exact text, with a file path that doubles as a citation. The answer is grounded and traceable. There is no hallucination risk from the retrieval layer because there is no retrieval layer. There are just files.

How an agentic loop uses the bundle

The typical pattern with an autonomous agent looks like this:

  1. The agent receives a question.
  2. It runs grep -ri "keyword" ./my-bundle/concepts/ to find relevant concept files.
  3. It reads those files with cat or a file-read tool.
  4. It composes an answer citing the file names as sources.

Every step runs locally. No API call leaves your machine unless you explicitly route to a remote model, which is your choice.

This pattern works identically across agents that support shell-command execution: Hermes Agent, Odysseus, OpenClaw, Claude Code, Cursor, Codex CLI, and any custom agent loop you write yourself. The OKF bundle does not care which agent reads it.

No plugin, no SDK, no lock-in

Closed cloud document assistants have a different architecture. Your PDF goes to their server. Their proprietary embeddings index it. Their API returns answers. If you want to access that knowledge from a different tool, you cannot. The chunks live in their vector store, behind their authentication.

With pdf2okf, the knowledge lives in a directory you own. You can copy it, version it with git, ship it as a .okfz archive, or symlink it into any agent's working directory. The tool that produced the bundle and the tool that reads it are fully decoupled.

This also means the bundle works offline. An agent running on a laptop with no internet connection can still query it. That matters in air-gapped environments, on travel, and anywhere bandwidth is limited or sensitive data must not leave the building.

The MCP path (on the roadmap)

MCP (Model Context Protocol) is an emerging standard for connecting agents to data sources through a typed server interface rather than raw file reads. pdf2okf has an MCP server on its roadmap.

When that ships, agents that speak MCP natively, including Claude Code and Cursor, will be able to point at an OKF bundle via a server config entry and query it without running shell commands at all. The underlying facts will be the same plain files; the interface will be higher-level.

Until then, the shell-command path described above is fully functional and requires nothing beyond what any agent already supports.

What cloud document AI cannot do

Cloud document assistants are convenient for casual use. They are not a fit when:

  • Documents contain personal data, medical records, legal advice, or trade secrets that must not leave your network.
  • Your queries need to be auditable: you need to know exactly which text the answer came from.
  • You operate in a regulated environment where GDPR, HIPAA, or legal privilege rules restrict data processing by third-party services.
  • You need the same knowledge accessible from multiple agents and workflows without re-uploading.

pdf2okf is built for exactly these constraints. The bundle is your data, in your filesystem, readable by any tool you choose. That is the foundation of a sovereign document pipeline. It starts with a single CLI command.

Start here

Convert a PDF, inspect the output, then run a simple grep query against the concepts directory. See what an agent can do when the retrieval layer is just files. Join the waitlist at pdf2okf.com to be notified when the CLI is available.

pdf2okf.com

Be there when it opens.

pdf2okf is in private build, self-hosted, sovereign. Leave an email and you'll be first in.