Penling penguin markPenling
11 June 2026· webdev, ai, programming, devops

I can't tell you why this code is the way it is.

Six months ago one of the engineers on my team shipped a Microsoft SSO integration. Worked first time, tests passed, PR was clean. It went in on a Thursday afternoon.

Last week someone raised a bug. Different flow, adjacent bit of auth. I pulled up the file to understand the shape of what was there before touching it.

The PR was spotless. Fourteen files changed, good commit message, all checks green. So I did what you do — git log auth.ts. Three commits. Dates, hashes, one-liners. Nothing I couldn't have inferred from reading the code.

What I actually wanted to know was: why Entra ID and not Okta? Were other providers considered? Was the session timeout deliberate or just the default? These aren't things you can read from a diff. They're the decisions that happened before the code was written — the ones that would have been in the ticket, or the design doc, or the Slack thread, or the engineer's head.

None of that survived the merge.


This is a new failure mode, and I don't think we've fully named it yet.

With human-written code, the reasoning is imperfect but it's at least somewhere. The engineer who wrote it had to understand the problem well enough to solve it. Even if they didn't document anything, they can usually reconstruct the thinking — why this approach, why not that one, what they were optimising for. The reasoning lives in their head. It's retrievable, if they're still around.

With AI-written code, that's not how it works. The model doesn't remember. The context that drove the implementation — the spec, the clarifications, the constraints — existed for the duration of the build and then evaporated. What you're left with is the output. The code and the tests, both technically correct, neither of which tells you anything about the decisions that shaped them.

The diff shows you what changed. It doesn't show you what was considered and rejected. It doesn't show you what's explicitly out of scope. It doesn't show you who decided, or why.


The review problem is downstream of this.

When I review a PR for AI-written code, I'm approving the implementation without ever seeing the decision tree behind it. I can check whether the code is correct. I can check whether the tests cover what they claim to. What I can't check is whether the right thing was built — because the spec that defined "right" isn't in the PR.

In the old model, a developer wrote the code, so they implicitly validated the approach as they went. They'd push back if the requirements were contradictory. They'd surface tradeoffs. The review was the last check in a chain that started with someone thinking the problem through.

With AI agents, that chain can be shorter than it looks. The code can be correct and the approach can be wrong at the same time, and the PR won't necessarily tell you which it is.


The fix is obvious once you've seen the problem clearly enough. You need the spec — the thing the AI built against — committed to the repository alongside the code it produced.

Not as a comment. Not in a Notion page that'll drift out of sync. In the repo, versioned with everything else, updated when the code is updated, reviewed in the same PR.

When you do that, the question "why is it built this way?" has a one-file answer. The decision to use Entra ID is there. The explicit exclusion of Okta is there. The session timeout is documented as deliberate, or as a default that was left pending a product decision. The context survives the merge because it was committed with the merge.

This is what we mean by spec-driven development. The spec isn't just a prompt. It's the permanent record of what the work was supposed to be — why this, not that, and what's out of scope.


The thing that surprised me, when we started doing this consistently, is how much it changes code review. Not because reviews get easier, exactly, but because they get more honest. You're not just checking whether the code works. You're checking whether it matches the intent. And because the intent is written down and in the PR, you can actually do that check.

The other thing it changes is the debugging conversation six months later. Instead of git log and archaeology, you open one file. The answer is there.

It's a small shift in how the work gets organised. The impact is disproportionate.


Penling is the tool we built to make this the default rather than the exception — a shared workspace where the spec is written before the build starts, handed to the AI agent via MCP, and committed to your repository when the PR opens. The reasoning travels with the code, permanently.

But the principle works without the tooling. If you're shipping AI code today, the question worth asking is: if someone opens this file in six months, what will they find? The diff, or the reasoning?

This is part of a series on adopting spec-driven development. Part one covers why specs make AI output reliable in the first place. Part three is on what goes wrong when the team skips the spec entirely.