Case study · Verified medical AI · 5 min read

When the transcript invents what wasn't said

The Associated Press ran an investigation in October 2024 that deserved more attention than it got. Whisper, OpenAI's speech-to-text model, hallucinates. Across the audio corpora researchers studied, it generated content that wasn't in the audio in roughly one segment out of a hundred. Sometimes whole sentences. Sometimes demographic descriptions of people the audio didn't describe. In clinical deployments, sometimes made-up medical advice no clinician spoke.

Whisper is the engine inside ambient-listening tools several hospitals are already deploying. Nabla is the most-named example; others exist. The output flows into the EHR, then into billing, then into the next visit, where the prior note is treated as ground truth by the next clinician, the next AI system, whoever opens the chart.

A clinician notices a missing note. Nobody notices a fabricated one.

The bug is structural

This is not an implementation flaw, in the usual sense of that phrase. It is a property of the architecture, present because nobody designed against it.

Speech-to-text models in the current dominant class predict tokens by sampling from a distribution that was learned during training and shaped by the language of the training corpus. At inference, the audio is supposed to condition the prediction. Mostly it does. When the audio is clear and within the distribution the model was trained on, the prediction tracks. When the audio is unclear, briefly silent, or at the edge of distribution, the model still emits a prediction. The prediction has to come from somewhere. It comes from training.

The architecture has no mechanism for refusal. There is no path through the forward pass that lets the model say "the input doesn't support this token; emit nothing." Output is mandatory. Anchoring is optional.

You can add training data. You can tune thresholds. Neither closes the gap. The gap is between "produce an output" and "produce only outputs the input justifies," and those are different problems, addressed by different architectures.

"We tested it on a thousand encounters" is the wrong evidence to ask for. The bug class is shaped by the architecture; the test plan is shaped by the audio researchers thought to record. The two shapes don't reliably overlap.

What a verified engine refuses to commit

The architectural property that closes this kind of gap is straightforward to state, and considerably less straightforward to ship. The engine must refuse to commit output the input does not anchor. Every output element gets traced to evidence. Output elements that can't be traced get a different name: unabsorbed evidence. The engine acknowledges them; it doesn't certify them.

The Shellfinity engine produces, for each ruling, what we call an exclusion certificate. It names what was ruled in, what was ruled out, and what evidence the input was missing that prevented further commitment. The certificate is the artifact that proves the architecture held.

We aren't solving transcription. The question this engine answers is one level upstream: when a system is asked to make a clinical commitment, what stops it from emitting a commitment the input doesn't justify? In most clinical AI shipping today, the stopping is done outside the system. Human review. Flag detection. Vendor reputation. Each is doing work, and the work matters, and none of it is the architecture refusing.

A formal-verification approach puts the refusal inside the system. The invariant (no commitment without anchor) holds because the proof says it holds. The operator doesn't have to catch what the system already refused to emit.

The procurement question this reshapes

Most procurement processes for clinical AI ask a benchmark question. Sensitivity, specificity, accuracy on the published validation set. Useful numbers. Not load-bearing for this class of failure.

The load-bearing question is architectural. What mechanism in your system prevents it from emitting content the input does not justify?

A vendor that can answer with a structural claim (an invariant, a verification target, a refusal mechanism that holds because the proof says so) is a vendor whose worst case is bounded by their architecture. A vendor that answers with a process ("our reviewers catch flagged cases," "we monitor for hallucination patterns") is a vendor whose worst case is bounded by their team. Both are bounded. The architectural bound is one you can model. The team bound is one you have to staff against. Different things, different planning horizons.

Ask anyway. The answers separate two categories.

A note on why this is in our series

We've been writing a methodology series on how we ship a formally verified evaluation engine. Panel review. Surface audit. Perf-stress. Tier accounting. Each addresses a class of failure that the discipline-free approach lets through.

This case study sits adjacent to that series rather than inside it. Same shape of argument, more specific target: the failure class where a system commits to content the input doesn't justify, and the architectural property that closes the gap. Whether it's worth closing is for you to decide. Whether it can be closed is a verification problem, and that's what the series is about.

Differential-diagnosis engine in pre-launch polish. Demo on request. Subscribe to follow the rest.

Subscribe for the rest of the series at shellfinity.substack.com.

Working on verified medical AI? Read the methodology series, or email [email protected] to see the differential-diagnosis demo and discuss pilot interest.

Direct correspondence: [email protected].