pdf2okf·

Wiki

Air-gapped document AI: fully offline, no network

When "offline" isn't offline enough

Most privacy-minded tools promise they work "offline." Press on that word and it usually means offline-capable: the thing can run without a live connection, but it still expects one now and then: a license check at startup, a model download on first run, a telemetry ping, an update notification. For most people that's perfectly fine. For an air-gapped environment, it's disqualifying.

An air gap is the strictest isolation there is: a machine (or an entire network) with no connection to the internet, and often no connection to any untrusted network at all. Nothing routes in; frequently nothing routes out either. The "gap" is literal: a span of air where a cable, a radio, or a route would otherwise be. It's an architecture for situations where a single outbound packet is treated as a breach.

Who actually needs one

Air gaps aren't paranoia for its own sake. They're standard practice where the cost of exfiltration is catastrophic and the threat model includes capable, persistent adversaries: classified and defence work, the control systems behind critical infrastructure (power, water, manufacturing), and the more sensitive corners of healthcare, legal, and industrial settings. They also guard plain commercial trade secrets (the formula, the source code, the unannounced deal), where a leak is existential even when no regulator is involved.

The common thread is simple. In these places, "probably encrypted in transit" is not an acceptable answer. The only network you fully trust is the one that doesn't exist.

Why most AI can't cross the gap

Cloud AI is out by definition: the entire product is sending your document to someone else's computer. The surprise is how many local tools also fail the test. "Local" and "air-gapped" are not the same claim.

A tool can run inference on your hardware and still assume a network for everything around it: pulling model weights or an embedding model on first use, validating a license, checking for updates, shipping anonymous usage stats, resolving a remote vector database or an external embedding API at query time. None of that is malicious. All of it is fatal in a true air gap, because the environment forbids any runtime network, not just the obviously risky kind.

This is also where one popular sovereignty feature quietly drops out. BYOK (bring your own key) lets you point a local tool at a frontier model on your own API key, a great pattern for running locally by default and reaching for the cloud only when you must. But an API key needs a network to reach the API. In a true air gap, BYOK simply does not apply: there is no "your own key" to a service you cannot route to. Air-gapped means a local open-weight model, full stop.

How fully-offline document AI actually works

Strip it down and the recipe is short:

  • A local open-weight model. The reading and the answering happen on the box, with no API call leaving it. On-device AI in its strictest form.
  • Controlled media for everything that has to come in. Models, tooling, and updates arrive by deliberate, audited transfer, the "sneakernet": a vetted file on a controlled drive, carried across the gap by hand rather than pulled from a registry.
  • No telemetry, no phone-home, no background sync. If a component can't be made silent on the network, it can't be inside the gap.

pdf2okf fits this shape unusually well, because of what it doesn't need. It greps OKF-compatible Markdown directly, so there's no vector database to host, no index to sync, and no embedding service phoning out at query time. An OKFZ is just files: a portable bundle of plain Markdown you can carry across the air gap on the same controlled media as everything else, then read on the far side with nothing but a local model and a shell. Build the bundle once, move it deliberately, and it behaves identically inside the gap and out. (OKF is Google's open standard; pdf2okf is compatible with it, not the inventor of it.)

What an air gap buys you

The payoff is sharp and worth naming. With no network at runtime, you remove an entire class of risk at the root: there is no path for network exfiltration, no remote compromise of the inference service, no foreign cloud whose jurisdiction can reach across a border to your documents. This is data sovereignty in its most literal form (not "hosted in the right country" but "not reachable from anywhere"), and it settles the whole tangle of residency, jurisdiction, and stack control by making it moot. The data can't leave because there's nowhere for it to go.

What it does not buy you

Here honesty matters more than the pitch. An air gap is an architecture, not a compliance certificate. And it moves work onto you rather than removing it.

When nothing arrives automatically, you own all of it: patching the OS and the tooling, updating models through controlled media, taking and testing backups, and the physical security of the machines themselves. The same gap that keeps attackers out also keeps your update pipeline out, and closing that loop is now your job. It does nothing about insider risk, either. The person already inside the gap, with a drive in their pocket, is exactly the threat the gap was built around yet cannot fully solve.

And isolation does not exempt you from your obligations. Self-hosting and air-gapping do not make you something other than a GDPR controller; they don't suspend your internal compliance program or your sector's rules. An air gap can be the cleanest way to meet those duties (it's the strict end of the spectrum laid out in on-premise vs. EU cloud), but it never performs them for you. You still do the paperwork; you just get to do it on your own terms.

The honest summary

An air gap is the strongest possible answer to "could this data ever leave?" It is also the most demanding to operate. It trades away remote reach in exchange for owning every operation yourself. pdf2okf is built to live comfortably on the far side of that gap: a local model, plain Markdown, no index, no phone-home. It's a tool that works the same whether or not there's a network, because it never needed one in the first place.

pdf2okf.com

Be there when it opens.

pdf2okf is in private build, self-hosted, sovereign. Leave an email and you'll be first in.