The CA our agent wouldn't trust for three weeks
One of our agents rejected a perfectly valid upstream certificate, on every connection, for three weeks. The certificate was fine. The CA that signed it was trusted by the host the agent ran on. The agent simply couldn't see that — because it had read the system trust store once, at boot, and the world moved on without it.
The setup
When the agent splices a TLS connection, it verifies the upstream server's certificate before relaying anything — the chain has to validate against the host's system CA store, or the agent refuses the connection and reports upstream_cert_invalid. That refusal is the safe direction: faced with an upstream it can't vouch for, it fails closed rather than relaying traffic it couldn't authenticate. Nothing about that design is wrong. The bug was in one word: once.
The bug
The trust store was loaded a single time, at startup. On a long-running fleet, the agents had been up since late May. The CA that signed their upstream's certificate was added to the host system store afterward — a normal provisioning step, done while the agents were already running. Nothing told them the store had changed. So each agent kept validating against the snapshot it took at boot, which predated the CA, and kept rejecting a certificate that the host itself — asked fresh — would have accepted without complaint. It did this hundreds of thousands of times over three weeks.
Why it hid
The error was perfect camouflage. upstream_cert_invalid is exactly what you see for a genuinely untrusted upstream — a real misconfiguration, a real bad cert. Nothing in that signal says "your view of trust is stale"; it looks identical to the cert actually being bad. It was on a test bucket, not a customer path. And because the failure was fail-closed — refusing a good upstream, never trusting a bad one — it tripped no security alarm. A conservative bug, quietly wrong, wearing the costume of a legitimate rejection.
What ended it
A routine self-update rolled through the fleet and restarted the agents. The rejections dropped to zero instantly. That is the whole tell: when a restart fixes it, you are looking at stale in-memory state. The restarted process re-read the trust store at boot, saw the CA that had been sitting there for weeks, validated the upstream, and went quiet. Nothing about the certificate, the CA, or the network had changed in that moment. Only the agent's snapshot did.
The lessons
Four, and none of them are about cryptography:
• A long-running process's view of system configuration is a snapshot, not a subscription. Read something once at startup and you are frozen at boot time — the CA stores, the DNS, the config files all move on, and you don't. Anything a process treats as input has to be refreshable, or it becomes a slow time bomb.
• "Restart fixes it" is a diagnosis, not a workaround. If bouncing the process clears the error, stop celebrating and go find what is read once and never refreshed. The restart didn't fix the bug; it reset the symptom.
• A failure that looks legitimate is the best place for a bug to hide. Telling "the cert is bad" apart from "my view of trust is stale" needed two sources: the host (which trusted the CA) versus the agent's memory (loaded before it). The discrepancy was the bug. A single source agreeing with itself told us nothing.
• Refresh has to be a first-class operation. The fix isn't a polling timer bolted on; it's making the agent reload its trust store on a signal, so adding or rotating an upstream CA takes effect without a restart. A real deployment hits this the day an operator rotates an internal CA — the same wall, on a path that matters.
This is the same lesson as the bug that hid PQC, from the other direction: there, our own constants agreed with each other and were all wrong; here, our own memory disagreed with the live system and was the thing that was wrong. The fix for both is the same instinct — trust the live state, not the cached one. It's also a small piece of why migrating cryptography is more about operations than math, and where we spend our time.
← All posts