Hallucination-free document Q&A: cited, deterministic answers

Why document Q&A hallucinates

Ask an LLM a question about a document and you're exposing it to two failure modes at once.

The first is memory bleed: the model supplements gaps in the retrieved text with knowledge from training. The answer sounds plausible because it often is, just for some other document, or for a general truth that doesn't hold for yours. A contract analyst asking "does this agreement cap liability at three times annual fees?" might receive a confident yes that reflects how liability clauses are typically written, not what this clause actually says.

The second is context degradation. Long documents fed as a single block cause retrieval quality to fall sharply past the first and last few thousand tokens. The middle of a 60-page PDF is effectively invisible to a naive pipeline. Third, chunking destroys structure. Traditional RAG splits documents into overlapping text windows, so a clause that straddles a chunk boundary arrives in fragments, and a table split across two chunks is read as prose.

The result is answers that are fluent and confident and untraceable. In a regulated context, an untraceable answer is not just wrong. It is useless.

The fix: ground every answer in a source file

The structural fix is to convert each meaningful unit of a document (a section, a clause, a table, a definition) into its own concept file before any question is asked. The concept file carries the original text, a stable identifier, and enough structure that a retrieval step can locate it precisely.

When a question arrives, the system retrieves the relevant concept files and passes them to the model. The model's task is then narrow: synthesize an answer from the files in context and cite the file it drew from.

That citation is what converts an answer into an auditable record. A reviewer can open the cited concept file, read the exact clause or row, and verify the answer in seconds. The answer is no longer a vibe. It is a verifiable claim with a pointer to the evidence.

This is what "AI with citations" actually means at the architectural level. It is not a post-hoc formatting choice; it is a constraint on how the model is allowed to answer.

Determinism: when the number has to be exact

Citations solve what the answer says. A separate mechanism solves how many.

Counting over long context is one of the most reliable ways to produce a wrong-but-plausible answer. A model asked to count 40 release entries in a changelog will answer "around 40", or miss three, or add two. The error is silent because the number arrives in the same confident tone as the right answer.

The reliable design is a clean division of labour: the model identifies structure, code executes the count. Concept files make this straightforward: a code layer counts them directly and returns the number. The model reports what the code returned; it does not estimate. The result is "40 releases", not "approximately 40". The sibling article, The model finds the structure, code does the counting, works through this pattern in detail.

An honest note on model size

The common assumption is that hallucination-free document Q&A requires a frontier model. For grounded extraction tasks (reading what the document says and reporting it), the evidence points elsewhere. When the right text is in context, small open models now perform nearly as well as frontier ones. The bottleneck is retrieval quality, not parameter count.

One finding here is counterintuitive: reasoning modes can reduce faithfulness on extraction tasks. Extended-thinking or chain-of-thought prompting encourages the model to draw on background knowledge to improve the answer. For multi-hop analysis, that is often what you want. For source-faithful extraction, it is the exact failure mode you are trying to prevent. The practical rule: enable reasoning for analysis, disable it for extraction.

Frontier cloud models still lead on hard multi-hop reasoning across very long documents. That is what a BYOK configuration is for: routing the tasks that genuinely need frontier reasoning to a model of your choice, without locking your data into any infrastructure.

Why this matters for legal and finance

In regulated industries, an answer without a source is not an answer. It is a liability. A contract team asking whether a specific clause exists cannot act on a confident "yes" they cannot trace back to a paragraph. An auditor asking how many invoices exceed a threshold cannot use "around 130" as a figure in a report.

The cited, source-grounded architecture converts a generative tool into a reference tool. It does not change what the model knows. It changes what the model is permitted to say, restricting it to the evidence in front of it and requiring it to show its work.

How pdf2okf implements this

pdf2okf converts PDFs into OKF-compatible bundles built around concept files. Each file holds one unit of meaning: a clause, a section, a table row, a definition. Retrieval operates at file granularity, not chunk granularity, so citations point to reviewable, human-readable text on disk.

Every answer can be traced to its source; every count is produced by code, not estimation; the entire pipeline runs locally. Deterministic AI is not a property of a particular model. It is a property of the architecture you build around it.