The model finds the structure, code does the counting

The problem with letting a model count

Language models are trained to predict plausible continuations of text. That makes them exceptional at understanding, classifying, and structuring information. It makes them unreliable at arithmetic over large sets.

Ask a model how many items are in a long list and it will answer: confidently, fluently, and incorrectly. Not always wrong by a lot. Often off by one or two. Sometimes by ten. The failure is invisible because the wrong number arrives in the same confident tone as the right one.

This is not a bug that will be fixed in the next model release. Counting over context is a structural weakness of autoregressive generation: the model must hold a running tally in its attention patterns across thousands of tokens, and that tally drifts. The longer the document, the worse this gets.

The design principle

pdf2okf is built around a clean division of labour: the model finds structure; code does the counting.

Here is what that means in practice. The model reads the source PDF and identifies the conceptual units: sections, clauses, entries, rows, whatever makes sense for the document type. Each unit becomes a concept file in the OKF bundle. When a user asks "how many X are in this document?", the code layer counts the concept files directly. The model receives that count from the code and reports it. It does not generate a number from its own estimation.

The model never sees a number it has to produce by counting. It only reports numbers the code has already computed. The difference sounds subtle. The reliability difference is not.

A concrete example

A software project has a changelog. The changelog has 40 release entries, each with a version number, a date, and a list of changes.

A naive pipeline passes the full changelog as text and asks the model to count the releases. The model returns "approximately 40" or "around 38–42" because counting 40 distinct entries across a long context is genuinely hard for a language model, not impossible, just unreliable.

pdf2okf parses the changelog into 40 concept files, one per release. The code layer counts: 40 files. The model reports: 40 releases.

That number is not approximate. It is not an estimate. It is what the code returned. The same approach works for contract clauses, invoice line items, changelog entries, regulatory requirements, patient records. It fits any document where the exact count matters.

Why this makes smaller models trustworthy

The model-finds-structure, code-counts pattern has an important side effect: it makes a small, locally-run open model as reliable as a frontier model for these tasks.

A 7B-parameter model running on a laptop cannot reliably count 40 items in a long context. But it can read 40 concept files, understand each one, and report a number the code calculated. The model's role is scoped to what it is good at: reading, classifying, and synthesizing. The arithmetic is handled deterministically by code.

That is why local, open-weight models (Gemma 4, Qwen3.5, or OLMo 3, running via Ollama or oMLX on Apple Silicon) are not a compromise in the pdf2okf architecture. Their limitations are worked around by design, not compensated for by scale. Deterministic LLM answers do not require a frontier model. They require a design that does not ask the model to do things models are bad at.

Auditable by construction

Every concept file has a stable identifier. A count answer can be audited: ask for the list of concept files that were counted, inspect any of them, and verify the total independently. The answer traces back to discrete, human-readable files on disk.

Compare that to a count generated entirely by a model: there is nothing to inspect. The number emerged from learned parameters and cannot be checked. This is the gap between a confident-sounding answer and an auditable one. In legal, finance, or compliance contexts, only one of those is actionable.

How this sits inside pdf2okf

When pdf2okf processes a document, the result is an OKF-compatible bundle where every meaningful unit is its own concept file. Counting reduces to a filesystem operation. Retrieval reduces to finding the right files. Analysis becomes a matter of passing the right context to the model, with the heavy deterministic work already done by code.

The pipeline runs locally, works offline, and is model-agnostic. The determinism comes from the architecture, not from a proprietary API. Which means if the model gets better, the answers get better. But the count is always exact.