Wiki
GDPR-compliant AI: why local inference removes the transfer problem
Two GDPR problems that fire the moment you hit "send"
Cloud LLMs are useful, but for personal data they trip two distinct parts of the GDPR at the same time, and most teams only budget for one of them.
Problem one: the third-country transfer (Chapter V)
Most frontier models are run by US companies on US-controlled infrastructure. Sending personal data into that pipeline is a transfer to a third country under Chapter V of the GDPR, which only permits such transfers under specific safeguards. You can paper the transfer with Standard Contractual Clauses (SCCs) or lean on the EU-US Data Privacy Framework, but those instruments don't remove the underlying exposure. A US-owned provider stays reachable under the US CLOUD Act wherever it stores the data, so the risk the safeguards are meant to address survives the safeguards themselves. The transfer is the problem, and in a cloud LLM the transfer is unavoidable.
Problem two: who is the processor (Art. 28)
The GDPR assigns roles. You are the controller; a compliant LLM provider should be a processor that acts only on your documented instructions under a data processing agreement, per Article 28. That chain holds only as long as the provider does nothing with your inputs except what you asked. The moment inputs get reused (to train or improve a model, to evaluate quality, to build features), the provider is acting for its own purposes, and for that activity it is no longer a mere processor. The clean controller-to-processor relationship the GDPR is built around quietly breaks, often buried in terms most users never read.
Why local inference neutralizes both
Run the model on your own hardware and both problems lose their cause at once.
- No transfer. If the document never leaves your machine, there is no third country, no Chapter V transfer, and nothing for SCCs or the Data Privacy Framework to patch. The question doesn't get answered. It never arises.
- No external processor. With no outside party touching the data, there's no processor to vet, no Article 28 agreement to negotiate, and no reuse-of-inputs clause to worry about. You remain the sole controller of your data, start to finish.
This is the structural reading the EDPB's Opinion 28/2024 on AI models points toward: the data-protection analysis turns on whether personal data is actually processed and exposed. Remove the exposure and you remove most of the surface. Local inference doesn't argue its way past the GDPR; it changes the facts the GDPR is applied to.
What local inference does not do
Be honest about the limit, because overclaiming here is its own compliance risk. Keeping inference local does not make you "GDPR-exempt," and it does not mean GDPR is "solved." It removes the transfer and the external-processor problems (two of the hardest ones), but you are still the controller, and a controller's duties don't disappear:
- Lawful basis is still required for whatever you process.
- Data-subject rights (access, rectification, erasure) are still yours to honor.
- Accuracy of what the system outputs about real people is still your responsibility.
- Your own governance (retention, access control, records, security of the local system) is still on you.
Self-hosting changes your role from someone trusting a chain of foreign processors into someone with a clean, self-contained data-protection story. That's a much stronger position. It is not zero obligations.
Where pdf2okf fits
pdf2okf is designed to keep you on the right side of both problems by default. It turns PDFs into OKF-compatible knowledge bundles on your own hardware, or against your own key. No page is sent to a third party, so there's no third-country transfer and no external processor in the loop. You stay the sole controller, with a data-protection story you can explain in one sentence: the data never left. What that buys you against US jurisdiction specifically is covered in the CLOUD Act; the wider framing is in data sovereignty for AI.