Wiki
What hardware do you need for local document AI?
Memory is the binding constraint
When people ask what machine they need to run a local model on their own documents, they usually picture raw speed: a fast GPU, lots of cores. The real gate is simpler: can the model fit in memory at all? If it fits, it runs; if it doesn't, no clock speed saves you. So before anything else, size the memory.
The honest news is that you almost certainly need less than you think. For grounded document Q&A (reading a passage and answering from it), a mid-sized open model on a well-specced laptop or a single consumer GPU is enough. You do not need a datacenter.
The Q4 rule of thumb
A full-precision model is stored at 16 bits per weight, which is too big for most machines. So you run a quantized copy: the same weights stored at lower precision. The community sweet spot is Q4_K_M: 4-bit weights that come out roughly 75% smaller than full precision for about a 3% quality loss. For document work, that trade is almost always worth taking.
That gives you the one formula worth remembering:
A 4-bit model needs roughly half its parameter count in gigabytes of RAM or VRAM for the weights, plus headroom for the context.
So an ~8B model wants around 4–5 GB, a ~14B model around 7–9 GB, a ~27–32B model lands in the mid-to-high teens of GB, and a ~70B model sits north of 35 GB, all before context. Treat every number here as a rough guide, not a spec sheet: exact footprints shift with the quantization, the runtime, and how much context you load.
Three ways to hold a model in memory
There are three hardware paths, and they trade price, speed, and ceiling differently.
Discrete GPU. The fastest option, and the most rigid. A GPU is quick because the weights live in its dedicated VRAM, but that VRAM is a hard wall. A consumer card with 12, 16, or 24 GB runs anything that fits and refuses anything that doesn't. A 24 GB card comfortably holds a quantized ~27–32B model; a ~70B model means a bigger card, or two of them. Buy for the VRAM number first; speed follows.
Apple Silicon. A Mac's trick is unified memory: one pool of RAM shared by CPU and GPU, with no separate VRAM ceiling. A 32, 64, or 128 GB Mac can devote most of that pool to a model, so it holds models that would otherwise demand a far pricier discrete GPU. It's the most forgiving path for larger models on a single machine, and it's why a Mac makes such a comfortable local-AI box, covered in local document AI on a Mac.
CPU + system RAM. The cheapest path, and the slowest. Any machine with enough RAM can run a model on the CPU alone, no GPU required. It works, and for the occasional question it's perfectly usable, but generation is markedly slower than on a GPU or Apple Silicon. Good for trying things out; less good if you live in the tool all day.
A sizing table by model
Rough RAM/VRAM at Q4, with an example machine. Use it to match a model to hardware you already own, or plan to buy.
| Model size | RAM/VRAM at Q4 (rough) | Example machine | |---|---|---| | ~8B | ~5–8 GB | a recent laptop, or any 16 GB machine | | ~14B | ~10–12 GB | a 16 GB GPU, or a 16–32 GB Mac | | ~27–32B | ~18–24 GB | a 24 GB GPU, or a 32 GB Mac | | ~70B | ~40 GB+ | a 64–128 GB Mac, or multiple GPUs |
The headroom above the bare weight size is for the operating system and the context window. Real models map onto these bands cleanly: Gemma 4's 31B dense and 26B mixture-of-experts variants and Qwen3.5's 27B dense sit in the ~27–32B row; their smaller siblings and Mistral's small 24B-class models cover the lighter rows.
Context window costs memory too
The weights aren't the only thing in memory. Every token you feed the model occupies space in the context window: the cache grows with how much text you load. A long document pasted in whole can cost as much extra memory as a slice of the model itself. This is the quiet reason "just dump everything in" scales badly: longer context means more RAM and slower responses.
The cheaper move is to retrieve only the relevant parts and hand the model those. Smaller context, less memory, faster answers. And, as it happens, often better answers, since a model handed the right paragraph beats one drowning in a hundred wrong ones.
What to actually buy for document Q&A
The sweet spot for document Q&A is a quantized ~27–32B model on a 32 GB Mac or a 24 GB GPU. That gives you a genuinely capable model with room for a healthy context window, on a single machine you can buy without a second mortgage. If your documents are straightforward, a ~8–14B model on a 16 GB machine is plenty. Size stopped being the bottleneck for grounded retrieval a while ago; retrieval quality matters more. For which model to load onto that hardware, see which open model for your documents; for the runtimes that serve it, running AI locally in 2026.
If you'd rather not host a model at all, BYOK is the other door: point the tool at your own API key (your provider, your region, your bill) and skip the local hardware question entirely.
The hardware you don't need
Here's the part that changes the budget. Classic RAG needs a vector database, and usually an embedding server to populate it: infrastructure you stand up, host, keep in sync, and re-embed every time a document changes. That's real hardware and real ops.
pdf2okf uses none of it. It greps OKF-compatible markdown directly, so there's no index to host and nothing to re-embed. The structure lives in the OKFZ bundle, not in a database. That removes a whole tier of the stack: you size for the model and nothing else. Run it fully on-device, where no page ever leaves the machine, or through your own key. Either way, the only hardware question left is the simple one this article answers. (OKF is Google's open standard; pdf2okf is compatible with it, and did not invent it.)