Self-hosted document AI for healthcare and patient data

A clinic's hardest data, sent to someone else's computer

Healthcare runs on documents. Discharge letters, lab reports, imaging findings, consent forms, research datasets: the most sensitive records an organisation can hold, and exactly the ones an AI assistant would be most useful for. The catch is structural: most AI assistants are cloud services, so "let the AI read it" means the patient record leaves the building. For health data, that single step is where the legal weight lands.

This article is about why that step matters so much, why on-premise or local document AI is usually the cleanest answer, and why it does not make the rest of the law disappear.

Health data is a special category: Art. 9 GDPR

Under the GDPR, most personal data is processed on an ordinary lawful basis. Health data is not ordinary. Data concerning health is a special category under Article 9 GDPR, and processing it is prohibited by default unless one of the specific conditions in Art. 9(2) applies: explicit consent, the provision of healthcare, scientific research under safeguards, and so on.

That default-prohibition framing changes the posture. You are not just asking "do I have a lawful basis?" You are asking "which narrow exception lets me process this at all, and have I met the conditions attached to it?" Special-category processing also triggers heightened obligations: appropriate technical and organisational measures (TOMs) sized to the risk, and in many cases a data protection impact assessment (DPIA) before you begin.

None of that is a reason to avoid AI. It is a reason to be deliberate about where the processing happens.

§203 StGB: confidentiality is also criminal law

In Germany a second layer sits on top of the GDPR. §203 StGB makes it a criminal offence for doctors and other professional secret-holders (Berufsgeheimnisträger) to disclose a patient's private secrets without authorization. The same logic that binds lawyers and tax advisers binds the medical professions.

Here is the nuance that vendor pitches tend to mangle: this does not make cloud AI illegal. A 2017 reform of §203 explicitly permits bringing in outside service providers (mitwirkende Personen), including IT and cloud providers, provided they are properly bound and the disclosure stays limited to what the cooperation requires. Using a processor is therefore lawful in principle. What §203 does is raise the stakes: an inadequately bound or over-reaching cloud arrangement is not merely a data-protection finding, it is potential criminal exposure for the practitioner. That makes cloud a risk question to be managed, not a settled prohibition, and it makes "the data never left" an unusually attractive position.

Why on-premise / local is the clean answer

The cleanest way to manage a risk is to remove its cause. If inference runs on hardware you control, on-device AI or against your own key (BYOK), the patient record never leaves your jurisdiction, never reaches a third-party processor, and never becomes a cross-border transfer.

No third-party disclosure. There is no external operator to bind under §203 and no processor chain to audit under Art. 28. You stay the sole controller.
No transfer problem. Nothing crosses a border, so the whole Chapter V third-country analysis, the part that gets US-owned clouds into trouble, simply does not arise. This is the same structural argument laid out for GDPR-compliant AI and, more broadly, for on-premise vs EU cloud.
Real data sovereignty. Residency, jurisdiction, and stack control collapse into one easy answer, because the data is on your disk.

For a hospital, a practice, or a medical-research group, that is the difference between a complicated trust story about a foreign vendor and a one-sentence answer: the records stayed on-site.

What self-hosting does not do

Be precise here, because overstating this is its own risk. Self-hosting does not exempt you from the GDPR. It removes the third-party transfer, not the duties. You remain the controller, and the special-category machinery still runs:

You still need a valid Art. 9(2) condition to process health data at all.
You still owe appropriate TOMs, and most patient-data AI workflows will still warrant a DPIA.
Data-subject rights, accuracy, retention, and access control are still yours to honour, now for a system you operate yourself.

There is also the EU AI Act to keep in view. It is risk-based and phased, and some medical uses can fall into the high-risk tier; being the deployer of a self-hosted system does not exempt you from the deployer obligations. Self-hosting simplifies the data-protection and confidentiality story; it does not delete the AI Act story. The detail is in EU AI Act and self-hosting.

In short: local inference gives you a much stronger starting position. It is not zero obligations.

Where pdf2okf fits

pdf2okf is built for exactly this end of the spectrum. It turns a PDF into OKF-compatible markdown concept files plus extracted figures, packaged as a portable OKFZ workspace, on your own hardware or against your own key. (OKF is Google's open standard; pdf2okf is compatible with it, not the author of it.)

An agent then reads those plain files directly by searching them. There is no vector database, no embedding service, and nothing uploaded to a third party. Because the answer comes from grepping the actual text, it is deterministic and cited back to the source concept, which matters when an output about a real patient has to be defensible.

The result for a healthcare team is the structural answer that both Art. 9 and §203 reward: the patient record never left your control, so the third-party transfer and the external-disclosure problems never arise. The GDPR duties that remain (the lawful basis, the TOMs, the DPIA) then sit on a clean, self-contained foundation instead of on top of a foreign processor you have to keep trusting.