Part 1 · Vertical Use-Cases · May 22, 2026

Where the LLM stops and the engine starts: a clinical case

Part 1 of the Shellfinity Vertical Use-Cases series.

The LLM ends at a ranked list; the engine begins there and ends with a record of why.

1. The case

An adult patient presents with sharp central chest pain that worsens with deep inspiration and lying flat. The patient finds relief sitting up and forward. The chart records a recent viral illness the prior week with fever and body aches. The chart is silent on effusion findings, breath sounds read normal, and the history is clear of recent immobilization or surgery.

The LLM ranks five differentials: acute pericarditis (rank 1), pericarditis with a likely viral cause (rank 2), chest pain of unspecified origin (rank 3), pleural effusion (rank 4), and viral myocarditis (rank 5).

The verification step looks at each candidate against the structured evidence in the chart. Acute pericarditis is consistent with the positional pleuritic pain and the recent viral prodrome. The viral-cause variant stays fully compatible with the chart. The unspecified chest-pain label is broadly compatible and uninformative. Pleural effusion fails: the required effusion findings are absent from the chart, so the engine declines to confirm it. Viral myocarditis has supportive overlap through the fever and the recent viral syndrome, and remains under consideration.

After verification the ranking shifts. Acute pericarditis holds rank 1 as a confirmed candidate, with viral myocarditis at rank 2 on supportive overlap and unspecified chest pain at rank 3 as broadly compatible. Pleural effusion drops to the bottom on missing chart evidence.

The engine then surfaces a candidate the LLM left off its list: pulmonary embolism. Pleuritic chest pain after a recent viral illness is a textbook must-rule-out for PE, and the chart features a clinician would use to raise that concern are present and uncontradicted.

Every part of the case is grounded in practice. The chart and the differentials are typical of a vignette the system handles today.

2. What the LLM saw vs what the engine saw

The LLM ranking and the verified ranking come from two different operations on the same input. The LLM is pattern-matching the symptom cluster against the training distribution: cases that look like this tend to be that. The pattern-match is fast and produces a ranked list that is approximately right. It is also approximately right in the same way every LLM looking at the same input is approximately right. The priors are the priors.

The verification step is doing a different category of work. For each candidate diagnosis, it checks the diagnostic criteria for that condition against the structured evidence in the chart. Criteria that are met support the candidate. Criteria that are absent demote it. Criteria that are contradicted reject it. The output is a ranking against the evidence in this specific chart, a different target from the training distribution.

Take pleural effusion, the fourth-ranked candidate from the LLM. The clinical criteria that would support it include decreased breath sounds over the affected hemithorax, dullness to percussion, and a documented effusion finding on imaging or auscultation. All of these are absent from the chart. The verification step treats this as a clean, recorded absence of criteria. The criteria are simply missing from the structured evidence, so the verification step demotes the candidate below those whose criteria do appear. The same procedure runs for every candidate the LLM proposed.

Two operations, same input, different rankings. The LLM ranking is approximately right against the training distribution; the verified ranking is approximately right against the evidence in this chart. The verification step also reads in the other direction. It scans the structured evidence in the chart for criteria that match diagnoses the LLM left unlisted. That is how pulmonary embolism surfaced. Pleuritic pain and a recent viral state are positive criteria for PE, and positive criteria count as evidence for a candidate regardless of the LLM's list.

3. What it means at scale

Across a year of benchmark runs on this suite, the combined system's hit rate moved from the low 90s to near 100 percent as the benchmark exposed edge cases and the system's coverage matured. Across those same runs, roughly one in three of the LLM's top-ranked proposals is rejected by the verification step. The clinically meaningful number is the rejection rate, because every rejected proposal is a plausible-but-wrong ranking that stops before it reaches the clinician.

A rejection at the verification step is a candidate that the LLM ranked highly and that the structured evidence in the chart leaves unsupported. Across any meaningful volume of clinical encounters the system processes, that one-in-three floor implies a large and steady share of plausible-but-wrong rankings that would otherwise appear in a differential. Each one is a candidate a clinician might have seen, weighed, and possibly ordered tests against, on the strength of a pattern match the chart itself leaves unbacked.

This is why a headline accuracy number is the wrong place to start. A 98 percent top-1 accuracy lacking an audit trail is worse for clinical deployment than a 95 percent top-1 accuracy with a per-diagnosis evidence chain. Take that chain away and the clinician loses any handle on why a single ranking landed where it did. What the verification step produces, beyond a re-ranked list, is the audit trail itself.

4. Why this matters for clinical deployment

The audit trail does concrete work. An accepted diagnosis arrives with the chain of evidence that supported it: which diagnostic criteria the verification step checked, which were met in the structured evidence in the chart, which were absent. A clinician looking at the verified ranking sees the candidates and the reasoning each candidate survived, candidate by candidate. The LLM's ranking is opaque on this dimension. The priors that produced it are baked into the model and stay sealed off from the output. The verification step writes a per-diagnosis record because writing one is what the step does.

The audit trail does a second kind of work too. Rare conditions are scored on the evidence in this chart. Frequency priors give way to that score. The clinical implication shows up in primary care, where long-tail conditions matter more in aggregate than the handful of headline-frequency diagnoses in any single encounter. A system that ranks by base rate misses the rare-but-supported diagnoses at exactly the volume the long tail predicts. That reverse-direction reading is how pulmonary embolism surfaced in the opening case. A ranking driven by base rates alone would have buried it; the chart's evidence surfaced it.

When a diagnostic decision is questioned post-hoc, whether by a colleague, an audit, or a legal proceeding, the evidence chain that supported the accepted candidate is the document the clinician produces. That record exists at the moment the diagnosis is accepted, with the checked criteria, the chart evidence that satisfied them, and the rejected candidates and the reasons for those rejections already attached. The reasoning was written down, so the record stands on its own. This is what an audit trail looks like in deployment. The gap between the 98-percent number and the 95-percent number from the previous section is documentary, and this is where it lands.

5. What comes next

Shellfinity is looking for clinical-informatics partners to evaluate this engine on real chart data. The right partner is a clinical informatics director or healthcare-AI lead with a deployment context: a chart corpus the team can run the engine against, a working definition of what a useful differential looks like in that setting, and the latitude to inspect the per-diagnosis evidence chain on accepted and rejected candidates alike. Evaluation is structured as a bounded engagement on a representative slice of the corpus, with the audit trail as the primary deliverable.

If this is useful:

Subscribe to the rest of the Vertical Use-Cases series: shellfinity.substack.com
Evaluate Shellfinity for medical: shellfinity.com/medical
Direct correspondence: [email protected]

Subscribe to the rest of the series at shellfinity.substack.com.

Evaluating verified AI for regulated work? See our medical deployment and join the early-access waitlist on the home page.

Direct correspondence: [email protected].