Ollama vs. llama.cpp vs. LM Studio vs. vLLM

Four tools, one job: run an open model locally

Once you have decided to run an open model on your own hardware, you have to pick the thing that actually loads it and answers requests: the runtime. Four names come up constantly, and they are not really competitors so much as different shapes of the same job: the easy on-ramp, the engine, the GUI, and the production server. Here is what each is for and, more usefully, which one to point pdf2okf at.

The comparison at a glance

| Tool | What it is | Default port | Best for | |---|---|---|---| | Ollama | The easiest on-ramp; one command pulls and serves a model | :11434 | Getting running in minutes; single user | | llama.cpp | The foundational C/C++ inference engine (CPU + GPU + Metal); it originated GGUF | :8080 | Control and minimal overhead; it underpins much of the ecosystem | | LM Studio | A desktop GUI with a model browser; ships both GGUF and MLX runtimes | :1234 | People who want a graphical app, not a terminal | | vLLM | Production GPU serving (PagedAttention, continuous batching) | – | Serving many users on NVIDIA/AMD GPUs (not a laptop tool) |

What sets them apart

Ollama is the gentle start. One command pulls a model and serves it on :11434, and since v0.19 it has an MLX backend for Apple Silicon. If you just want a local model talking back to you today, start here.

llama.cpp is the engine under much of the field: a lean C/C++ inference core that runs across CPU, GPU, and Apple's Metal, exposes an OpenAI-compatible API on :8080, and originated the GGUF format that nearly everything else loads. Reach for it when you want maximum control and minimal overhead, or when you are building on top of the engine directly.

LM Studio is the graphical option: a desktop app with a built-in model browser to find and download models, a chat UI to use them, and a server on :1234. It ships both GGUF and MLX runtimes, so on a Mac you can pick the faster Apple-native path without leaving the app. Good for anyone who would rather not live in a terminal.

vLLM is the odd one out, on purpose. It is built for production GPU serving. PagedAttention and continuous batching keep a fleet of NVIDIA or AMD GPUs saturated so you can serve many concurrent users efficiently. It is not a laptop tool. If you are one person on one machine, vLLM is the wrong answer; if you are standing up a shared inference server for a team, it is the right one.

They all speak the same API

Here is the quiet superpower that makes the choice low-stakes: all four expose an OpenAI-compatible API. Ollama on :11434, llama.cpp on :8080, LM Studio on :1234, vLLM on its server endpoint: same request shape underneath. So pdf2okf points at any of them the exact same way. You change a base URL, not your tooling. Swap Ollama for LM Studio on a Tuesday and pdf2okf neither knows nor cares.

Single user vs. a team server

The decision collapses to one question: who is this serving?

One person, one machine: pick Ollama for the fastest start, LM Studio if you want a GUI, or llama.cpp if you want the bare engine. On a Mac, LM Studio's MLX runtime (or oMLX) is the fast path.
A team, a shared box: that is vLLM on a GPU server, sized for concurrent load.

A note on quantization

Whichever runtime you choose, you will run a quantized model: the same weights stored at lower precision so they fit your memory. GGUF is the standard quantized container (from llama.cpp; loaded by Ollama and LM Studio too), and Q4_K_M is the widely agreed sweet spot: about 75% smaller than full precision for roughly 3% quality loss. To size it, use the rule of thumb: a 4-bit model needs roughly half its parameter count in gigabytes of RAM, plus a little for the context window's KV cache. So a 7B fits in about 4 to 5 GB and a 27B-class model in about 14 to 16 GB.

Where pdf2okf fits

pdf2okf is model- and runtime-agnostic by design, and that is the whole point of this comparison being low-stakes. It produces an OKF-compatible bundle of plain Markdown concept files, and the structure lives in the OKF bundle, not in the runtime. Ollama today, vLLM when you scale to a team, LM Studio on the laptop you travel with: the bundle is identical and your answers stay grounded in it. Pick the runtime that fits who you are serving, point pdf2okf at its OpenAI-compatible endpoint, and keep your documents on hardware you control. For the Apple Silicon path specifically, see local document AI on a Mac; for the bigger picture, running AI locally in 2026.