Standalone · Supervisory Evidence · AI Governance

Your Logs Are Tamper-Proof. They Still Can’t Answer the Question.

~7 min read · By Jeroen Janssen · June 2026

Stop treating “everything is logged” as evidence of control. It is evidence of recording. They are not the same thing, and the gap between them is where your next compliance failure lives.

Here is the test that tells you which one you have. Take a determination you would actually have to make for a regulator. Did personal data leave the controlled environment. Did the information barrier hold. Was the human able to intervene. Was the delegated authority valid when it was used. Now go to your records and check whether the answer is recoverable from what you store, or whether you would be reconstructing it by hand, from fragments, and hoping.

If it is reconstruction by hand, you do not have an oversight system. You have an archive, and a supervisor standing in front of it who cannot read it for the fact they are accountable for. The rest of this piece is why that happens even when the logging is flawless, and what separates a record that can answer from one that cannot.

An Agent Composes Its Own Behaviour

Start with why agentic AI is different, because the difference is the whole problem.

A conventional system has its control flow fixed at design time. You can read the code and know, in advance, the paths it can take. An agent does not work that way. It uses a model to decide what to do and a harness to do it, and the sequence of steps it takes, its execution path, is chosen while it runs. You cannot list those paths in advance, and the choice can be steered by the content the agent reads along the way. A tool does what you told it. An agent composes what it does as it goes.

That single property breaks the instruments enterprise governance reaches for. The point-in-time audit, the conformity checklist, the periodic review: all of them are static. They assess the system at a moment, against fixed criteria, and produce a record that is fixed once produced. They were built for software whose behaviour was decided in advance. An agent’s behaviour is not decided in advance. It is generated, continuously, at machine speed, along paths nobody enumerated.

So the record those instruments produce is the wrong shape before anyone even asks a question of it. It is a snapshot of a system that does not hold still. And that is the record you later walk up to and ask a question it was never built to answer.

Integrity Is Not Answerability

A tamper-evident log proves one thing: the record was not altered after the fact. Hash-chained, timestamped, qualified under eIDAS, with a legal presumption of integrity. That property is real and it is necessary. It is also not the property you need.

Integrity tells you the record is intact. Answerability tells you the record contains the fact. A regulator does not ask whether your log was tampered with. A regulator asks whether the barrier held. You can have a perfect, unbroken, cryptographically sealed chain of events and be unable to answer that, because the answer was never in the record to begin with.

The smallest case makes it concrete. Two agents, one organisation. Agent A may read a deal room holding material non-public information. Agent B may send to a trading desk. The barrier is the rule that nothing from the room reaches the desk. Each agent behaves perfectly: A reads, which is allowed; B sends, which is allowed. No single action breaks a single rule, and your log flags nothing.

The breach, if there is one, is not in either action. It is in the relation between them, information from A’s read reaching B’s send. A record of two permitted actions does not contain that relation. The log cannot answer “did the barrier hold,” not because it is incomplete, but because the fact is not a function of what it recorded. You are not missing a row. You are missing a structure.

You are not missing a row. You are missing a structure.

What a Record Needs, and Why Volume Won’t Supply It

For a finding of fact about events and their relations, a record can answer the question only if it carries two things. Typing: that the data was of a protected class, that the channel was external, that this read touched material non-public information. And the relation: that B’s send derived from A’s read, that this action ran under that authority, that the authority was still valid at that instant.

A record with neither does exactly one thing. It raises a suspicion. It flags that something looks off. What it cannot do is establish a finding, and the determinations EU law places on a supervisor are findings, not suspicions. “Something looks wrong” is not an answer to “did protected data leave the environment.” The regulator wants a fact, not a hunch.

This is why capturing more does not help. The standard response to oversight anxiety is volume: more events, more fields, more retention. But if the record does not carry the typing and the relation, doubling the volume answers no new question. The determinations that stay unanswered are unanswerable in principle from a record of that shape, not for want of size. You are not under-logging. You are logging the wrong structure, at scale, with great discipline.

There is a comfortable objection, and it concedes the point. Someone says a properly instrumented trace, with lineage tags and data classifications and cross-agent correlation, would catch the breach. Correct. And a trace instrumented to those categories is no longer a bare log. It carries the typing and the relation. That is the whole argument. The disagreement is only over the name. You do not need a particular product. You need that structure, in some form, and a record without it cannot answer the question however tamper-proof it is.

Why Your Existing Framework Doesn’t Close This

The natural objection at this point is that surely something already handles it. You have a governance framework. You run NIST. You are certified to ISO 42001. One of these must cover it.

None of them do, and the reason is structural, not a flaw in any one of them. The frameworks that govern AI systems split into two kinds, and each kind misses the same thing.

One kind specifies process, not evidence. NIST’s risk-management framework and ISO 42001 tell you to have functions, owners, reviews, a management system. They certify that you run a process. They say nothing about whether a given record can answer a given finding. The other kind produces records, not findings. Generic logging, observability stacks, and SIEM-style capture record events without typing them to legal categories or carrying the relations that determinations turn on. And the pieces that carry one half do not carry the other: provenance graphs carry the relation but not the legal typing, classification schemes carry the typing but not the relation.

So the frameworks that govern AI systems either produce records without the required structure, or specify processes rather than evidence. Nothing in standard use produces, by itself, a record that can answer the determination. That is not an oversight you can patch inside one framework. It is a gap none of them was built to fill, and it is the reason this is worth writing about at all. The thing that would close it is not yet in your stack. Assuming it is already there, somewhere, under one of those acronyms, is the mistake.

The Law Splits Exactly Here

The EU AI Act does not treat this as one obligation, and neither should you.

Article 12 is a recording duty: log events over the system’s lifetime. A bare, uninterpretable log may satisfy a good part of it. Events are recorded, traceability of a kind exists, and Article 12 does not on its face demand that the log be readable for a legal fact.

Article 14 is the one that bites. It asks that natural persons be able to effectively oversee the system, which means understanding it well enough to intervene and override. Oversight that cannot read the record for the operative fact is not effective oversight. So the two duties come apart, precisely at the typing and the relation. A record can satisfy Article 12 logging and still fail the Article 14 oversight the logging was meant to serve.

State the consequence carefully, because the honest version is conditional. A record that is not semantically and relationally interpretable cannot serve as evidence that effective oversight was possible. Not “is illegal.” Cannot evidence that oversight was possible. The supervisor cannot read from it the fact the law requires to be establishable, because the fact is not there.

One note on timing, because it points at the same gap. The high-risk obligations were set to apply from 2 August 2026 for standalone Annex III systems. The Commission’s Digital Omnibus on AI, agreed politically in May 2026, would defer that, with the new dates not yet adopted or published in the Official Journal at the time of writing. The Commission’s stated reason for the delay is to align the deadline with the availability of harmonised standards and support tools. The standards that would specify what an oversightable record must contain do not yet exist. The obligation is being deferred because the specification is missing, and that specification is the typing and the relation. Verify the current dates against the Official Journal before you rely on them.

The Two Limits Are the Point, Not the Caveat

A method that overclaims is not worth your time, and the two boundaries below are not disclaimers tacked on at the end. They are what makes the claim true rather than sold.

The first is the surface. This only works for flows that cross something you can instrument. A material and uncharacterised fraction of consequential flow in a real agent system does not: information moves through the model’s own internal state, through opaque tool calls, through retrieval, through summarisation that destroys provenance, through a human who reads one screen and types into another. Where the leak runs through one of those channels, no schema senses it. The claim is bounded to flows that cross a capturable surface. Beyond that surface is an instrumentation problem this does not pretend to close, and anyone who tells you their evidence layer covers everything has not found the channels it misses.

The second is sharper, because it is the one that gets sold past. A structured record buys legibility, not truthfulness. It makes a record readable for a finding. It does not make the record honest. A legible record can still be a curated one, typed to show compliance and to omit what would show its absence. What changes is where the failure lives. With a bare log, the supervisor cannot see. With a gamed schema, the supervisor can see what someone chose to show, which is at least auditable in a way blindness is not. Legibility is necessary. It is not the same as the truth, and a schema that promises the truth is making a claim it cannot keep.

That second limit is the whole brand of this work stated in one line. Legibility you can build. Truthfulness you have to prove, every time, against a record that can be gamed. The criterion raises the floor of what a supervisor can read. It does not relieve anyone of the work above the floor.

The One-Line Version

A tamper-proof record that cannot answer the regulator’s question is theater with a cryptographic seal on it. The question is never whether you have logs. It is whether control is real or whether it is theater, and the evidence layer is where you find out.

Governance is not what you claim. It is what you can prove. And you can only prove what your record can actually answer.

This post draws on the working paper “From Record to Finding: Why Tamper-Proof Logs Cannot Establish Legal Oversight of Agentic AI” (Janssen, 2026), which states the criterion formally, shows it by construction, and pre-registers an experiment to test it against expert judgement. Read the full working paper, or the open-access version under CC BY 4.0 at doi.org/10.5281/zenodo.21025237. A revised version is on arXiv as “From Runtime Records to Legal Findings: An Evidentiary-Adequacy Criterion for Agentic AI Oversight” (arXiv:2607.00941). The companion supervisory-evidence ontology is deposited on Zenodo under CC BY 4.0.

Sources

Janssen, J. (2026). From Record to Finding: Why Tamper-Proof Logs Cannot Establish Legal Oversight of Agentic AI. Working paper. apparens.nl/essay-record-to-finding. https://doi.org/10.5281/zenodo.21025237. arXiv version: arXiv:2607.00941
Janssen, J. (2026). A Supervisory-Evidence Ontology for Agentic AI under EU Law. Zenodo. DOI: 10.5281/zenodo.19758441.
Clarkson, M. R. & Schneider, F. B. (2010). Hyperproperties. Journal of Computer Security, 18(6).
Regulation (EU) 2024/1689 (Artificial Intelligence Act), Articles 12, 14.
European Commission (2026). Digital Omnibus on AI. Shaping Europe’s Digital Future.