Part 3 · Methodology series · 6 min read

What unit tests can't see and how to find it anyway

A feature in our engine had every unit test green. The change passed code review and the panel pass at its phase boundary. The build pipeline ran the test suite at every commit. It ran clean. The change merged. Under realistic production load, the engine crashed.

The bug was the one I mentioned in a single line in Part 1's list of recent findings: a memory-management primitive used inside a loop, accumulating state across iterations, fatal at production scale. The unit tests had exercised that loop with inputs sized for a unit test, which is to say, small. The production input ran the same loop with iteration counts several orders of magnitude larger.

The unit tests were written well. They missed the bug because it lives in a region of the input space the unit tests skip past. The test plan left out a case that exercised the loop at the right size. A test that exercised the loop at production size would have surfaced the bug before any of the other tests had finished.

Why unit tests can't see it

The bug class is structurally invisible to unit tests. A unit test exercises one path with one input. Some bugs need the path to be exercised many times, or with a specific input shape, or in interaction with another path that runs at the same time. A test that exercises each path once at small N stops short of the conditions that make the bug fire.

More test cases of the same shape leave it hidden. A different test shape surfaces it.

A test case is a specific input. A test shape is the combination of input size, duration, concurrency, and ordering the test exercises. Adding test cases of the same shape adds redundancy. If the bug class is "this loop accumulates memory across iterations until a threshold is crossed," tests of the loop at unit-test size stay well below the threshold. The bug lives at the production size.

The bugs outside that bound are invisible to the entire test plan, every case in it.

This is why "more tests" is sometimes the wrong answer to "we missed a bug." The set of bugs that more tests of the same shape can catch is bounded by the test shape. The bugs outside that bound are invisible to the entire test plan, every case in it. Finding those bugs requires a different shape of test, run in a different stage of the pipeline.

The discipline: perf-stress as discovery

Perf-stress is the discipline. The same code path that production exercises is run at production-realistic scale before the production deploy. The run exists to find what breaks. Verifying the answer is a separate job.

In our internal write-ups we call this perf-stress-as-discovery. The framing matters. A correctness test asks "did the system produce the right output?" A perf-stress run asks "what state does the system enter under production load?" The two test plans look similar from the outside. Both run the system on inputs. They hunt for different classes of failure. A correctness suite misses a memory-accumulation bug because the test finishes long before the memory accumulates. A perf-stress run with the right shape finds it in minutes.

The other half of the discipline is the worked example. Before any production-going code is written for a phase, the phase's runbook articulates a concrete production-scale input the implementation will handle. The example is concrete. A sketch falls short. It is a real input, at the size or rate the production deploy will see. The runbook commits to it. The worked example becomes the perf-stress target the moment the implementation is ready.

Pre-implementation timing is the part that surprises most teams. The natural instinct is to ship working code and then test it under load. The discipline reverses that: the load case exists first, the code answers it second. Two consequences fall out. First, design surprises surface against the worked example before they surface against shipped code. Second, when the implementation is done, the stress run is one command away, because the input and the success condition were committed before the implementation began.

Each cost is concrete. Building the worked example takes engineering time. The stress run takes machine time. Both are smaller than the cost of finding the bug after deploy.

What makes it reproducible

Three properties make the discipline work.

Concreteness. The worked example is a specific input. A category description falls short. "A request that hits the rate limiter" is a category. "A burst of one thousand requests in three seconds against the canary deployment" is an example. The first abstracts; the second is the literal input the stress run will use. Categories are easy to write and easy to skip past. Examples force the design to confront what the implementation will actually see.

Scale-true. The example runs at the size or rate the production deploy expects to handle. Half-scale stress half-tests the system. The bug class this discipline targets only emerges at full size, so a half-size stress run misses the point. The trade-off is machine time: a full-scale stress run is more expensive than a unit test. That cost buys the bugs that stay hidden from unit tests.

Pre-implementation. The example is committed to the runbook before the implementation begins. This is the property that surprises most readers. The natural sequence is implement-then-stress; the disciplined sequence is stress-target-first, then implementation. Two effects follow. The design has to accommodate the example up front, which surfaces design surprises early. And once the implementation lands, the stress run is one command away.

The cost is roughly a day of engineering effort per phase, plus the machine time to run the stress. The pay-off is a record of bug classes the discipline catches:

A memory-management primitive used inside a loop that accumulated state across iterations and crashed at production scale.
A concurrency interaction between two correct paths that surfaced only when both ran simultaneously at production rate.
A buffer-size assumption that held under unit testing and failed when a single production input crossed the assumed bound.
A counter that wrapped around at a value reachable only after extended production runtime.

The first is the case from the opening. The other three are real findings the discipline has produced over recent phases.

Why we publish this

For technical buyers. The bug class your unit tests structurally miss is the bug class your competitors are silently shipping. If a vendor's release process skips a perf-stress phase against a committed worked example, the vendor ships scale-only defects to production. Their users find them first.

For people thinking about defensibility. The discipline compounds. Each phase produces a worked example. The example rolls forward into the next phase as a regression target. The set of stress-tested cases grows. The probability that a new defect lands in a region the discipline has yet to stress shrinks. The artifact is portable, reusable, and accumulates. A vendor that ships this discipline ships a regression history; a vendor that ships ad-hoc load tests ships a snapshot.

What's next

The rest of the series unpacks the disciplines that travel alongside perf-stress. The three-tier accounting we use to make our trusted assumptions explicit, instead of hiding them behind one number. A retrospective on how the disciplines compose, with measurements on how much engineering time the compounded practice saves. Each will arrive in this series.

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See our NLP deployment and join the early-access waitlist on the home page.

Direct correspondence: [email protected].