Part 2 · Methodology series · 6 min read

The five-minute check that prevents months of no-op work

Earlier in this verification project, one of our phases ran to completion. The runbook walked it step by step. The tests it added were green. The change merged. Weeks later, during the phase retrospective, we noticed the phase had produced no semantic change in the codebase. The runbook had threaded a parameter through modules whose shapes had already drifted. Every step was syntactically valid against the design. None of them touched the production path the design had meant for them to touch.

The cost was several days of careful, well-reviewed engineering effort that the codebase ended up the same after as before. The five-minute check that would have caught it before any of that work was written: a handful of grep commands against the live codebase, asking whether the modules the runbook claimed to thread through still existed in the shape the runbook assumed.

Why "trust the design" is the failure mode

The pattern has a name in our internal write-ups. We call it trust the design. A runbook is a careful artifact. It plans each step against the codebase as the design imagined it. When a design is written, that imagined codebase matches the real one. By the time the runbook executes, the codebase has often moved. A module has been renamed, a function signature has narrowed, a caller has been refactored out of the path the runbook intends to touch.

The runbook does not know any of this. The runbook is text. It walks its steps in order against whatever the codebase happens to be on the day the work begins. If the steps are syntactically valid against the new codebase, they will run. If the production path the steps were meant to affect is no longer the path those steps actually reach, the phase will complete with green tests and no semantic change.

The retrospective will eventually surface the drift. By then the engineering time is spent. The fix is the same engineering time, repeated against the codebase as it actually is.

The surface audit moves that discovery in time. It runs at the start of the phase, against the real codebase, and answers one question: are the assumptions the rest of this runbook depends on still true?

What the audit actually is

The audit is executable bash. Each phase's runbook opens with a step labelled Step N.0, before any new code is written for that phase. The step is a short script: typically three to six grep commands. Each command checks one surface assumption the rest of the runbook is about to lean on.

The assumptions split into three classes. The first is existence. Does the module the runbook intends to extend still exist under the name the design wrote? The second is shape. Does the function the runbook references still expose the signature the rest of the steps will call against? The third is routing. Is the downstream caller the runbook intends to modify still the entry point through which the production path flows?

Each class becomes a grep. For existence, the script greps for the file path or the module name and counts the matches. For shape, it greps for the function name and verifies that the surrounding signature has not changed since the design was written. For routing, it greps for the call site and confirms that the caller is reachable from the production entry point, and not from a deprecated branch.

The script's verdict is binary. Every assumption matches the runbook, or one of them does not. If one does not match, the phase halts. The runbook gets revised against what is actually in the codebase, the audit re-runs, and the phase begins only when every check passes. The audit's results, including the exact grep output, are documented in the runbook at the step we call N.0.a. That documentation survives in the phase retrospective.

A passing audit takes about five minutes. A failing audit takes about ten, because it surfaces what is wrong and the runbook needs a small revision before re-running. The discipline pays for itself the first time it catches a no-op phase before the work begins.

What makes it reproducible

Three properties make the audit work.

Executability. The audit is bash. A read-through cannot replace it. A read-through is a narrative inspection: someone looks at the design, looks at the codebase, and makes a judgment about whether the two still agree. Read-throughs are easy to do under load. They are also easy to skip when the design feels familiar. Bash does not skip. The script either runs and returns a passing verdict, or it does not.

Placement. The audit lives at Step N.0, before any new code is written for the phase. Every later step depends on the audit's assumptions holding. Putting the audit last would have the same information value as not running it. Putting it first means a failure cancels the phase before the cost of the work has been spent.

Binarity. The audit returns one of two outcomes. Every check passed, or one of them did not. There is no middle outcome. A "mostly passed" verdict would degrade into "passed when we are in a hurry," which is the failure mode the discipline exists to prevent.

The cost is five minutes per phase, ten if the audit finds something. The pay-off is a record of drift patterns the audit screens for at every phase:

The audit has caught two no-ops in recent phases before any code was written. We also paid, once, the cost of the original incident that produced the rule. A handful of disciplined minutes per phase, against the entire cost of a phase that produced nothing.

Why we publish this

For technical buyers. Surface audits are the cheapest reproducibility win in multi-phase verification work. If a vendor cannot describe the audits that run at the start of each phase of their delivery, their delivery is hostage to the gap between the design that was written and the codebase that is being modified. That gap is where the no-ops live.

For people thinking about defensibility. Process artifacts like this one compress engineering hours over the life of a system. The audit is one of several disciplines we have codified in writing, with the same shape every time: an audit, with executable verification, at the boundary where the failure mode appears. Each discipline pays its cost in minutes and recovers its cost the first time it catches something. Compounded, the disciplines turn a quarter of careful engineering into a fraction of that.

What's next

The rest of the series unpacks what each of the other disciplines looks like in practice. The bug class that unit tests cannot see by construction, and the discipline that catches it anyway. The three-tier accounting we use to make our trusted assumptions explicit. A retrospective on how the disciplines compose. Each will arrive in this series.

Next arc: Vertical Use-Cases, Part 1 (Medical DDx)

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See our governance posture and join the early-access waitlist on the home page.

Direct correspondence: [email protected].