Part 4 · Methodology series · 6 min read

Three tiers of trust: how we account for what our system relies on

Asked how much of our system we trust, we used to give a single number: total lines of code the correctness argument depends on. The number was honest. It was also nearly useless.

The useful question is a deeper one: what kind of trust each piece carries. A line of code that produces a verdict is one kind of trust. A line of code that rejects a malformed input at a boundary is a different kind. A line of code that records a metric is a third. If a verdict line changes, the system can return a wrong answer that slips past every other check. If a metric line changes, the metric is wrong and correctness holds. A single number blurs which is which.

What follows is the tier discipline we use to keep that question answerable, and the log that keeps the answer durable.

Why one number hides risk

The trusted compute base of a system is the code its correctness argument leans on. The natural way to report it is a single number. The natural number is total lines.

The natural number is wrong. A few hundred lines of code that produce verdicts is a different kind of risk than a few hundred lines of code that decorate a response with metrics. Both might be inside the trusted compute base under a permissive definition. Only one of them can return a wrong answer to a user if it changes. Lumping the two together produces a TCB count that is honest about its arithmetic and silent about its risk.

The number invites the wrong management conversation. A budget meeting asks "can we get TCB under five hundred lines?" The answer that minimizes the number cuts the easiest pieces. The easiest pieces are usually the ones that barely mattered for correctness. The number drops. The risk holds.

A single number tells you how much code you trust. It stays silent on what would break if that code changed.

A useful accounting answers a different question. It asks what would break, and how badly, if each piece of trusted code changed. Instead of a number, the accounting produces a partition.

The three tiers

The cleanest place to see the tiers is at the language boundary the system crosses. The same handful of lines that hand a verdict from one runtime into another sits next to a pointer-validity check sits next to a counter that records how many verdicts were produced. Three pieces of code, one location, three risk levels.

The first piece is correctness-trusted. If a marshaller transposes two fields, the verdict that arrives downstream is wrong and downstream code treats it as proof of correctness. Every other stage in the system trusts the verdict as it stands. The marshaller is part of the trusted compute base in its strongest sense. Changes to it are the changes that need the strongest review.

The second piece is defensive. The pointer-validity check at the same boundary catches malformed inputs that the runtime above it was meant to keep out. If the check fails open, a malformed input crosses and may crash a downstream consumer. A verdict stays intact through all of it. Correctness, in the soundness sense, holds as long as the check rejects malformed inputs and returns an error instead of letting them through.

The third piece is instrumentation. The counters and diagnostic fields attached to each response record what the boundary saw: how many verdicts were produced, how many were rejected at the defensive layer, how long each took. If a counter wraps or an off-by-one slips in, an operator dashboard is wrong. The verdict that produced the count is still correct.

Three tiers, one boundary, three risk levels. A useful accounting reports them as three lines: a few hundred trusted lines, tens of defensive lines, around a hundred instrumentation lines. A budget conversation against that accounting can ask the right question. The right question is whether a piece can be moved down a tier and still hold the correctness argument. The wrong question, the one a single TCB number invites, is whether the total can be cut.

The decisions log as partner discipline

Tier accounting captures the surface. The decisions log captures the choices behind it.

Every entry in our decisions log carries two lines that look optional and are not. One line names what was chosen. The other names what was rejected and why. A year later, a maintainer reading only the first line sees a deliberate trade-off and finds the trade-off space empty; the choice looks arbitrary. The same maintainer reading both lines sees the same trade-off and the alternatives that were weighed against it; the choice looks like the choice it was.

A worked instance: a piece of code that lived in the correctness-trusted tier was a candidate for relegation to the defensive tier by adding a check that downstream code already implied. The argument for relegation cut trusted lines. The argument against pointed out that the implied check, on close reading, covered only part of the input space the upstream code actually produced. The log records the choice (keep in correctness-trusted) and the alternative (relegate plus add the defensive check). A maintainer next year reads both lines and sees that the relegation was considered and rejected for a reason. The reason is in the log.

The two disciplines work as a pair. Tier accounting tells a reader what is trusted. The decisions log tells the reader why each piece is in the tier it is in. Either one alone leaves the next maintainer guessing. Together they make the trust surface a thing a reviewer can read and a manager can budget against.

Why we publish this

For technical buyers. Ask any vendor about their trusted code in tier terms: how much is correctness-trusted, how much is defensive, how much is instrumentation, and what was rejected to put each piece in its tier. A vendor that struggles to answer in tier terms is shipping an uninspected trust surface. The vendor's risk becomes yours, and you inherit it blind.

For people thinking about defensibility. The discipline compounds. Each decision-log entry rolls forward as a load-bearing record; each tier reassignment is a recorded choice. The trust surface stays interpretable over the life of the system. A vendor that ships this discipline ships a trust map; a vendor that ships a single TCB number ships an aggregate.

What's next

Part 5 closes the series. A retrospective on how the disciplines compose, with measurements on what the compounded practice produces over a quarter of work. It will arrive in this series.

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See the platform and join the early-access waitlist on the home page.

Direct correspondence: [email protected].