What unit tests can't see and how to find it anyway
A feature in our engine had every unit test green. The change passed code review and the panel pass at its phase boundary. The build pipeline ran the test suite at every commit. It ran clean. The change merged. Under realistic production load, the engine crashed.
The bug was the one I mentioned in a single line in Part 1's list of recent findings: a memory-management primitive used inside a loop, accumulating state across iterations, fatal at production scale. The unit tests had exercised that loop with inputs sized for a unit test, which is to say, small. The production input ran the same loop with iteration counts several orders of magnitude larger.
The unit tests did not miss the bug because they were poorly written. They missed it because the bug lives in a region of the input space the unit tests do not visit. A test that exercised the loop at the right size was not part of the test plan. A test that exercised the loop at production size would have surfaced the bug before any of the other tests had finished.
Why unit tests can't see it
The bug class is structurally invisible to unit tests. A unit test exercises one path with one input. Some bugs need the path to be exercised many times, or with a specific input shape, or in interaction with another path that runs at the same time. A test that exercises each path once at small N will never hit the conditions that make the bug fire.
More test cases of the same shape will not catch it. A different test shape will.
A test case is a specific input. A test shape is the combination of input size, duration, concurrency, and ordering the test exercises. Adding test cases of the same shape adds redundancy. If the bug class is "this loop accumulates memory across iterations until a threshold is crossed," tests of the loop at unit-test size will never visit the threshold. The bug lives at the production size.
The bugs outside that bound are invisible to the entire test plan, every case in it.
This is why "more tests" is sometimes the wrong answer to "we missed a bug." The set of bugs that more tests of the same shape can catch is bounded by the test shape. The bugs outside that bound are invisible to the entire test plan, every case in it. Finding those bugs requires a different shape of test, run in a different stage of the pipeline.
The discipline: perf-stress as discovery
Perf-stress is the discipline. The same code path that production exercises is run at production-realistic scale before the production deploy. The point of the run is not to verify the answer. The point is to find what breaks.
In our internal write-ups we call this perf-stress-as-discovery. The framing matters. A correctness test asks "did the system produce the right output?" A perf-stress run asks "what state does the system enter under production load?" The two test plans look similar from the outside. Both run the system on inputs. They hunt for different classes of failure. A correctness suite cannot find a memory-accumulation bug because the test does not run long enough to accumulate the memory. A perf-stress run with the right shape can find it in minutes.
The other half of the discipline is the worked example. Before any production-going code is written for a phase, the phase's runbook articulates a concrete production-scale input the implementation will handle. The example is concrete. A sketch will not do. It is a real input, at the size or rate the production deploy will see. The runbook commits to it. The worked example becomes the perf-stress target the moment the implementation is ready.
Pre-implementation timing is the part that surprises most teams. The natural instinct is to ship working code and then test it under load. The discipline reverses that: the load case exists first, the code answers it second. Two consequences fall out. First, design surprises surface against the worked example before they surface against shipped code. Second, when the implementation is done, the stress run is one command away, because the input and the success condition were committed before the implementation began.
Each cost is concrete. Building the worked example takes engineering time. The stress run takes machine time. Both are smaller than the cost of finding the bug after deploy.
What makes it reproducible
Three properties make the discipline work.
Concreteness. The worked example is a specific input. A category description will not do. "A request that hits the rate limiter" is a category. "A burst of one thousand requests in three seconds against the canary deployment" is an example. The first abstracts; the second is the literal input the stress run will use. Categories are easy to write and easy to skip past. Examples force the design to confront what the implementation will actually see.
Scale-true. The example runs at the size or rate the production deploy expects to handle. Half-scale stress half-tests the system. The bug class this discipline targets only emerges at full size, so a half-size stress run misses the point. The trade-off is machine time: a full-scale stress run is more expensive than a unit test. That cost is the cost of catching bugs unit tests cannot see.
Pre-implementation. The example is committed to the runbook before the implementation begins. This is the property that surprises most readers. The natural sequence is implement-then-stress; the disciplined sequence is stress-target-first, then implementation. Two effects follow. The design has to accommodate the example up front, which surfaces design surprises early. And once the implementation lands, the stress run is one command away.
The cost is roughly a day of engineering effort per phase, plus the machine time to run the stress. The pay-off is a record of bug classes the discipline catches:
- A memory-management primitive used inside a loop that accumulated state across iterations and crashed at production scale.
- A concurrency interaction between two correct paths that surfaced only when both ran simultaneously at production rate.
- A buffer-size assumption that held under unit testing and failed when a single production input crossed the assumed bound.
- A counter that wrapped around at a value reachable only after extended production runtime.
The first is the case from the opening. The other three are real findings the discipline has produced over recent phases.
Why we publish this
For technical buyers. The bug class your unit tests cannot catch by construction is the bug class your competitors are silently shipping. If a vendor's release process does not include a perf-stress phase against a committed worked example, the vendor is not finding scale-only defects before deploy. Their users are.
For people thinking about defensibility. The discipline compounds. Each phase produces a worked example. The example rolls forward into the next phase as a regression target. The set of stress-tested cases grows. The probability that a new defect is in a region the discipline has not yet stressed shrinks. The artifact is portable, reusable, and accumulates. A vendor that ships this discipline ships a regression history; a vendor that ships ad-hoc load tests ships a snapshot.
What's next
The rest of the series unpacks the disciplines that travel alongside perf-stress. The three-tier accounting we use to make our trusted assumptions explicit, instead of hiding them behind one number. A retrospective on how the disciplines compose, with measurements on how much engineering time the compounded practice saves. Each will arrive in this series.