Local document AI on a Mac: Apple Silicon, MLX & oMLX

Why a Mac is a surprisingly good local-AI box

The thing that makes Apple Silicon punch above its weight for local language models is unified memory. On a Mac, the CPU and GPU share one physical pool of RAM instead of copying tensors back and forth across a PCIe bus to a separate graphics card. There is no host-to-device transfer, no VRAM ceiling that is smaller than your system memory. A model can use almost the entire RAM of the machine, and a 32 GB or 64 GB Mac quietly out-punches many desktop GPUs that ship with only 8 to 12 GB of dedicated memory. For document Q&A, where you want a capable model resident and responsive, that is exactly the property you want.

MLX, MLX-LM, and the mlx-community models

MLX is Apple's open array and machine-learning framework, built specifically for the unified-memory architecture of Apple Silicon. Because arrays live in shared memory, operations can run on the CPU or the GPU without an explicit copy, which is where a lot of the efficiency comes from.

On top of it sits MLX-LM, the package that runs and fine-tunes large language models with MLX. You do not have to convert and quantize models yourself, either: the mlx-community organization on Hugging Face publishes pre-quantized MLX builds of the open families you would actually want: Gemma, Qwen, Mistral, Llama. Pick a model, pick a quantization, and it is ready to load.

The practical payoff: on Apple Silicon, MLX is generally faster than llama.cpp and GGUF for smaller models. If your machine is a Mac and your model is in the small-to-mid range that fits comfortably in memory, the MLX path is usually the quickest one to a responsive local assistant.

oMLX: a local server pdf2okf can point at

A framework is not yet a server. That is what oMLX is: a specific local inference server for Apple Silicon, built on MLX, that exposes both OpenAI- and Anthropic-compatible endpoints. It runs as a menu-bar app. You start it, it serves a model, and anything that speaks those APIs can talk to it. It requires Apple Silicon, macOS 15+, and at least 16 GB of RAM.

This is the piece that matters for pdf2okf. Because pdf2okf points at any OpenAI-compatible endpoint, you can aim it straight at oMLX's local endpoint and run the whole pipeline on the Mac in front of you. It is BYOK-shaped (bring your own key), except the "provider" is your own laptop and there is no key and no network call. Nothing leaves the machine.

If you would rather not run oMLX, you have two other on-ramps to MLX on the same hardware: Ollama added an MLX backend in v0.19 (March 2026), and LM Studio ships an MLX runtime alongside its GGUF one. All three expose an OpenAI-compatible endpoint, so from pdf2okf's point of view they are interchangeable.

How much RAM do you actually need?

The sizing rule is simple. A 4-bit model needs roughly half its parameter count in gigabytes of RAM, plus a little overhead for the context window (the KV cache). So:

| Model size | Approx. RAM (4-bit) | Comfortable on | |---|---|---| | 7B | ~4–5 GB | any modern Mac | | 27B-class | ~14–16 GB | a 32 GB Mac |

A capable 27B-class model in a 4-bit quantization lands around 14 to 16 GB, which fits a 32 GB Mac with room for the OS and the context to spare. The standard quantization to reach for is Q4_K_M in the GGUF world (about 75% smaller than full precision for roughly 3% quality loss), and the mlx-community builds offer the equivalent 4-bit MLX quantizations. Match the model to the memory you actually have and it runs; overshoot and it crawls or refuses to load.

MLX vs. GGUF on a Mac

Both work on Apple Silicon. GGUF (run by llama.cpp, Ollama, or LM Studio) is the portable standard that runs everywhere, including Macs via Metal. MLX is the Apple-native path that tends to be faster for the smaller models that fit a laptop. The honest guidance: if you are on a Mac and chasing speed on a model that fits in memory, try the MLX build first; if you want one format that travels to every machine you own, GGUF is the safe default. You do not have to choose forever. Your documents do not care which one you run.

Where pdf2okf fits

pdf2okf is model- and runtime-agnostic by design. It produces an OKF-compatible bundle of plain Markdown concept files, and the structure lives in the bundle, not in the model, so the model and the server underneath it are swappable parts. Point pdf2okf at oMLX, at Ollama's MLX backend, or at LM Studio, and your Mac runs the entire thing: the documents stay on disk, the inference happens in unified memory, and nothing leaves the machine. It is the cleanest possible shape of local AI, and on Apple Silicon it is also a fast one. To see the runtimes side by side, read Ollama vs. llama.cpp vs. LM Studio vs. vLLM; for the broader picture, running AI locally in 2026.