Wiki
Running AI locally in 2026: a guide for document Q&A
The shape of a local stack
In 2026 the question is no longer whether you can run a capable model on your own hardware. You can. The real question is which pieces to assemble. A local document-Q&A stack has three parts: a runtime that loads and serves the model, an open model that does the reading, and a quantization that shrinks the model to fit your RAM. Get those three right and you have a private assistant for your documents that never phones home.
Pick a runtime
The runtime is the program that turns a model file into something you can talk to.
- Ollama is the easiest on-ramp. One command pulls a model and serves it on
localhost:11434. It speaks an OpenAI-compatible API, and since v0.19 it can use an MLX backend on Apple Silicon. Start here if you just want it working. - llama.cpp is the engine underneath much of the ecosystem: a lean inference core that runs GGUF model files and exposes an OpenAI-compatible server on
:8080. Choose it when you want control and minimal overhead. - LM Studio is the graphical option: a desktop GUI that downloads, runs, and chats with models (both MLX and GGUF) and serves an API on
:1234. Good for people who prefer not to live in a terminal. - vLLM is for servers: high-throughput GPU serving when you need to handle many requests at once, not a single laptop.
- MLX and oMLX are the Apple Silicon path, covered below.
All of these expose an OpenAI-compatible endpoint, which is the quiet superpower: anything that can call that API, pdf2okf included, can point at your local model instead of a cloud one.
Pick a model
Four families cover almost every local document-Q&A need, and all four are genuinely open:
- Gemma 4 (Google, released 2026-04-02, Apache-2.0): multimodal, with a 128K–256K context window, offered from tiny edge variants (E2B/E4B) up to a 26B mixture-of-experts and a 31B dense model. It runs offline on consumer GPUs and Apple Silicon, and the Apache-2.0 license is the headline: no usage-restriction strings.
- Qwen3.5 (Alibaba, 2026-03, Apache-2.0): a strong local all-rounder with long context, available in dense sizes and a mixture-of-experts variant.
- Mistral (France/EU): its small models have historically shipped under Apache-2.0, part of their sovereignty appeal; check the license on each model card.
- OLMo 3 (Ai2) is the "fully open" reference: weights, training data, code, and checkpoints, so the whole model is auditable.
For a model-by-model comparison, including EU-origin options, see which open model for your documents. One license note worth carrying: Llama 4's Community License prohibits EU use, so an EU-facing stack should reach for Gemma, Qwen, Mistral, or OLMo instead. The difference between these "open" labels is unpacked in open weights vs. open source vs. fully open.
Quantization and the RAM rule of thumb
A full-precision model is too big for most machines, so you run a quantized version: the same weights stored at lower precision. The community sweet spot is Q4_K_M: roughly 75% smaller than full precision for about a 3% quality loss. That trade is almost always worth it.
The handy rule for sizing: a 4-bit model needs about half its parameter count in gigabytes of RAM, plus some overhead for the context window (the KV cache). So a 7B model fits in roughly 4 to 5 GB, and a 27B-class model lands around 14 to 16 GB. Match the model to the memory you have and it runs.
Apple Silicon specifics
Macs are unusually good at this because of unified memory: the GPU and CPU share one pool, so a model can use most of your RAM without a discrete graphics card. Apple's MLX framework is built for that architecture, and oMLX is a dedicated, OpenAI-compatible MLX server for Apple Silicon (a menu-bar app) that exposes a local endpoint. pdf2okf can point straight at it, BYOK-style, except the "provider" is your own Mac, and nothing leaves it.
Are they good enough?
Honest answer: for grounded document Q&A, yes. When the job is to read a passage you have already retrieved and answer from it, today's small open models are genuinely capable. Size is not the bottleneck. Retrieval quality is. A 7B model handed the right paragraph beats a frontier model handed the wrong one. Counterintuitively, heavy reasoning can reduce faithfulness on extraction: a model that "thinks" too hard embellishes instead of quoting.
Where frontier cloud models still pull ahead is hard, multi-hop reasoning: chaining many facts across a long argument. That is exactly what BYOK (bring your own key) is for: run everything locally by default, and reach for a frontier model on your own key only for the rare question that needs it.
Where pdf2okf fits
pdf2okf is model-agnostic. It produces an OKF-compatible bundle of plain Markdown concept files, and the structure lives in the bundle, not in the model. That makes the model a swappable part: run Gemma 4 today, Qwen tomorrow, a frontier model on your key for one stubborn question. The bundle never changes, and your answers stay grounded in it. Point any runtime above at pdf2okf and your documents are read where they live, on hardware you control: the cleanest version of GDPR-compliant AI, because no page is ever uploaded. Own the model, own the format, and no one can switch off your ability to read your own files.