On-premise document AI for finance: BaFin, MaRisk & DORA

In a bank, a cloud AI prompt is an outsourcing decision

When an analyst pastes a credit file, a KYC dossier, or a draft prospectus into a cloud chatbot, it doesn't feel like procurement. In a regulated financial institution, it is. The document is confidential client data, and the chatbot is an external IT service that now processes it. For a bank, insurer, or asset manager, routing a confidential document into a third-party system is outsourcing ICT, and outsourcing in finance is one of the most heavily supervised activities there is.

That reframing changes the question. It is no longer "is this tool any good?" but "have we met our outsourcing and ICT-risk obligations before the data left the building?" For most cloud AI tools used ad hoc, the honest answer is no.

To be clear up front: using cloud AI is not illegal, and nothing here says it is. It is permitted, common, and governed. The point is precisely that it is governed, and the duties attach to you, the regulated firm, not to the vendor.

MaRisk: outsourcing is allowed, but never free

In Germany, BaFin's MaRisk (Minimum Requirements for Risk Management) is the supervisory rulebook that fleshes out the German Banking Act. Its outsourcing module, AT 9, gives concrete shape to the outsourcing duties in § 25b KWG. The core idea is simple: you can outsource an activity, but you cannot outsource the responsibility for it.

For anything that qualifies as material outsourcing, AT 9 expects you to have done the work in advance:

a documented risk analysis that decides whether the arrangement is material in the first place;
contractual rights to information, audit, and instruction, so you, and your supervisor, can actually look inside the service;
continuity safeguards for an exit, so the function survives if the provider fails or the contract ends;
ongoing monitoring and a regular reassessment of the arrangement.

A cloud AI service that ingests client documents is a strong candidate for material outsourcing. If you cannot show the risk analysis, the audit rights, and the exit plan, the gap is yours, not the provider's.

DORA: ICT third-party risk became a hard rule in 2025

The European layer is now even more explicit. DORA, the Digital Operational Resilience Act, Regulation (EU) 2022/2554, has applied since 17 January 2025. It treats dependence on ICT providers as a financial-stability issue, not just an IT detail, and it imposes obligations that map directly onto "we sent documents to an AI vendor":

you must keep a register of information on all ICT third-party arrangements (the first submissions to competent authorities fell due in 2025);
you must manage concentration risk, the danger that everyone leans on the same handful of providers;
the framework adds oversight of critical ICT third-party providers at EU level.

A cloud AI vendor is an ICT third-party provider in DORA's sense. Every confidential workflow you push through one is another line in the register, another dependency to monitor, and another contribution to concentration risk in a market where the underlying models are supplied by very few firms.

Underneath it all: confidentiality and the CLOUD Act

Two more facts sit beneath the regulatory text. The first is Bankgeheimnis, the bank's duty of confidentiality toward its clients, which doesn't pause because the channel is an AI tool. The second is jurisdiction. Most frontier models run on US-owned infrastructure, and the US CLOUD Act lets US authorities compel a US-controlled provider to produce data it controls, wherever that data is stored. So a Frankfurt region buys you data residency, but not control: EU data residency is not the same as data sovereignty, and "the server is in Europe" is not a defence against a US order. The mechanics are spelled out in the CLOUD Act explained, and the residency-versus-sovereignty gap in on-premise vs EU cloud.

Why on-premise or BYOK is the cleanest answer

There is a structural move that dissolves much of this at once: don't send the document to a third party at all. Run the inference on-premise, on on-device AI, or against your own API key (BYOK), and the third-party ICT dependency for that step largely disappears.

The regulatory consequences follow directly:

MaRisk AT 9: if there is no external provider processing the data, the inference step is no longer a material outsourcing to assess, audit, and unwind. Your own systems are in scope, but the outsourcing chain is gone.
DORA: no new critical ICT provider to enter in the register, and no added concentration risk on a handful of US model vendors.
Bankgeheimnis and GDPR: the confidential document never leaves your control, so there is no transfer and no foreign jurisdiction to reason about.

This is not a marketing distinction. It is a different architecture that produces a different class of obligation.

What self-hosting does not do

Be precise about the limit, because overstating it is its own compliance risk. Self-hosting does not make you exempt from MaRisk, DORA, or the GDPR. You remain the regulated party, and you take on duties rather than shedding them:

Your own ICT risk management under DORA still applies to the systems you now run.
Operational resilience, change management, and security of the local stack are now your responsibility, not a vendor's.
GDPR controller duties (lawful basis, data-subject rights, accuracy, governance) are unchanged.
And, again, cloud AI remains a legitimate, regulated option. Self-hosting is the cleaner answer for confidential financial documents, not the only lawful one.

What changes is the shape of the problem: from supervising a foreign provider you cannot fully see, to governing a self-contained system you fully control. For a regulated firm, that is usually the stronger position.

Where pdf2okf fits

pdf2okf is built for exactly this end of the spectrum. It turns a PDF into OKF-compatible markdown concept files plus extracted figures, packaged as a portable OKFZ workspace. (OKF, the Open Knowledge Format, is Google's open standard; pdf2okf is compatible with it, it did not invent it.) An agent then greps those plain files directly (there is no vector database to build, secure, or audit), so answers are deterministic and cited back to the source page rather than reconstructed from an opaque embedding store.

Crucially, all of this runs locally, on-device, or against your own key (BYOK), so the confidential document never leaves your control. There is no external processor to enter in a DORA register, no material outsourcing to justify under MaRisk AT 9, and no Bankgeheimnis question to litigate, because nothing was sent. You still owe your own regulators your own controls; pdf2okf simply removes the third-party document-AI dependency that is hardest to defend. For the data-protection mechanics underneath, see GDPR-compliant AI.