Wiki
Document AI for law firms: §203, client privilege & self-hosting
The files a law firm cannot afford to leak
A law firm's working material is almost entirely privileged: client correspondence, contracts under negotiation, litigation strategy, evidence, settlement figures. These are exactly the documents that make a strong case for document AI: long, dense, full of facts a lawyer needs to find fast. They are also exactly the documents you are not free to hand to whoever you like.
Pasting a privileged file into a third-party cloud AI is not a neutral productivity choice. It is a disclosure. You are sending a client's secrets to a company you do not control, often in another jurisdiction, frequently under terms that let the provider reuse what you send. For a German lawyer that is not merely a privacy question. It is a question that reaches into the criminal law.
§203 StGB: why this is a criminal question, not only a policy one
German law treats a lawyer's duty of confidentiality as a Berufsgeheimnis, a professional secret, and backs it with criminal sanctions. § 203 Abs. 1 StGB makes it an offence for members of certain professions, including lawyers (Rechtsanwälte), to disclose without authorization a secret that was entrusted to them in that capacity. The duty is personal to the Berufsgeheimnisträger, and a breach is a crime, not merely a contractual slip or a regulatory fine.
That framing changes how seriously you have to take "where does this document go." If a tool routes a privileged file to a third party, the first question is not "is this convenient" but "is this an unbefugte Offenbarung", an unauthorized disclosure.
But cloud AI is not automatically illegal
This is where careful reading matters, because the loud version of the story, "cloud AI is illegal for lawyers", is wrong, and acting on it leads to bad decisions.
In 2017 the legislature reformed § 203 precisely because professionals rely on outside help, including IT and cloud providers. § 203 Abs. 3 StGB now expressly recognizes mitwirkende Personen, people who assist in the professional activity, and permits disclosing secrets to them to the extent necessary for that assistance. The flip side is a hard condition: those assisting persons must themselves be bound to secrecy, with their own criminal liability for a breach.
So the accurate statement is: involving an external IT or cloud service provider is permitted in principle, provided the involvement is necessary and the provider is properly bound to confidentiality. Cloud AI is therefore not a blanket criminal prohibition. It is a disclosure-and-contract problem: you may only use it lawfully if you can bind the provider, document the necessity, and trust that the chain actually holds. The legal risk does not vanish. It moves into the contracts and into your ability to enforce them against a counterparty that may sit under a foreign legal system.
The duties stack: GDPR, CLOUD Act, AI Act
§ 203 is the sharpest edge, but it is not the only one. Three regimes apply at once, and they reinforce each other rather than cancel out.
GDPR. Client files are full of personal data, so processing them is regulated data processing. A cloud LLM that sends those files abroad triggers the GDPR's transfer rules and its controller-processor obligations on top of the confidentiality duty. The mechanics are covered in GDPR-compliant AI. In short, a lawful basis and a real processor relationship are both required, and "the provider trains on your data" quietly breaks both.
The CLOUD Act. Many firms try to solve this with EU data residency: "our provider stores everything in Frankfurt." That addresses geography, not jurisdiction. Under the US CLOUD Act, a US-owned provider can be compelled to produce data in its control wherever it is stored, so a Frankfurt region with a US parent is still reachable under US law. Residency is not the same as sovereignty; the detail is in the CLOUD Act explained. For a lawyer, that residual reachability is precisely the kind of exposure that is hard to square with a binding secrecy obligation.
The EU AI Act. Going the other way (self-hosting) does not put you outside the rules either. If a firm runs a model to process its own matters, it is acting as a deployer, and deployer duties (AI literacy, transparency where it applies, and the high-risk obligations if your use ever falls into that tier) still attach. As the AI Act and self-hosting sets out, self-hosting trims the provider-side story; it does not hand you an exemption.
Why local or BYOK is the cleanest answer
Put the three regimes together and one approach is structurally simpler than the rest: don't disclose the file to a third party at all.
Run the model on the firm's own hardware (on-device AI) or against your own provider key (BYOK), and the privileged document never leaves your control. There is no mitwirkende Person to bind under § 203, because there is no outside party in the loop. There is no third-country transfer to paper over, and no US-owned processor for a foreign order to reach. The hardest parts of the analysis don't get answered so much as they stop arising.
Two honest caveats, because overclaiming here is its own risk:
- Self-hosting is not an exemption. You remain the GDPR controller, you still owe your clients the full professional confidentiality duty, and your AI-Act deployer obligations still apply. What local inference removes is the disclosure to an outside party, the single riskiest ingredient, not the rest of your duties.
- You still have to run it well. Access control, retention, and security of the local system are now your responsibility instead of a vendor's. That is a stronger position, not a free one.
Where pdf2okf fits
pdf2okf is built for exactly this constraint. It turns a PDF into OKF-compatible markdown concept files plus the figures it extracts, packaged as a portable OKFZ workspace. An agent then reads those plain files directly (it greps the text, with no vector database in the middle), so answers are deterministic and cited: the code counts the exact figures, and the model reports them rather than guessing.
Crucially, all of that runs on your own hardware, or against your own key. No page of a privileged file is uploaded to a third party, so there is no outside disclosure to justify under § 203, no transfer to safeguard under the GDPR, and no foreign processor to worry about under the CLOUD Act. (For the record: OKF, the Open Knowledge Format, is Google's open standard; pdf2okf is compatible with it, not the author of it.)
That doesn't make compliance disappear; nothing can. It removes the one thing a law firm genuinely cannot risk, sending a client's secrets to someone else's computer, and leaves you with a confidentiality story you can state in a single sentence: the file never left.