Where the LLM stops and the engine starts: a clinical case
Part 1 of the Shellfinity Vertical Use-Cases series.
The LLM ends at a ranked list; the engine begins there and ends with a record of why.
1. The case
An adult patient presents with sharp central chest pain that worsens with deep inspiration and lying flat. The patient finds relief sitting up and forward. The chart records a recent viral illness the prior week with fever and body aches. There are no documented effusion findings, no decreased breath sounds, and no recent immobilization or surgery.
The LLM ranks five differentials: acute pericarditis (rank 1), pericarditis with a likely viral cause (rank 2), chest pain of unspecified origin (rank 3), pleural effusion (rank 4), and viral myocarditis (rank 5).
The verification step looks at each candidate against the structured evidence in the chart. Acute pericarditis is consistent with the positional pleuritic pain and the recent viral prodrome. The viral-cause variant adds nothing the chart contradicts. The unspecified chest-pain label is broadly compatible and uninformative. Pleural effusion fails: the chart has no effusion findings, so the engine declines to confirm it. Viral myocarditis has supportive overlap through the fever and the recent viral syndrome, and remains under consideration.
After verification the ranking shifts. Acute pericarditis holds rank 1 as a confirmed candidate, with viral myocarditis at rank 2 on supportive overlap and unspecified chest pain at rank 3 as broadly compatible. Pleural effusion drops to the bottom on missing chart evidence.
The engine then surfaces a candidate the LLM did not propose: pulmonary embolism. Pleuritic chest pain after a recent viral illness is a textbook must-rule-out for PE, and the chart features a clinician would use to raise that concern are present and uncontradicted.
Nothing about the case is hypothetical. The chart and the differentials are typical of a vignette the system handles today.
2. What the LLM saw vs what the engine saw
The LLM ranking and the verified ranking come from two different operations on the same input. The LLM is pattern-matching the symptom cluster against the training distribution: cases that look like this tend to be that. The pattern-match is fast and produces a ranked list that is approximately right. It is also approximately right in the same way every LLM looking at the same input is approximately right. The priors are the priors.
The verification step is doing a different category of work. For each candidate diagnosis, it checks the diagnostic criteria for that condition against the structured evidence in the chart. Criteria that are met support the candidate. Criteria that are absent demote it. Criteria that are contradicted reject it. The output is not a ranking against the training distribution; it is a ranking against the evidence in this specific chart.
Take pleural effusion, the fourth-ranked candidate from the LLM. The clinical criteria that would support it include decreased breath sounds over the affected hemithorax, dullness to percussion, and a documented effusion finding on imaging or auscultation. None of these appear in the chart. The verification step does not read this as low confidence in a fuzzy sense. The criteria are simply not present in the structured evidence, so the verification step demotes the candidate below those whose criteria do appear. The same procedure runs for every candidate the LLM proposed.
Two operations, same input, different rankings. The LLM ranking is approximately right against the training distribution; the verified ranking is approximately right against the evidence in this chart. The verification step also reads in the other direction. It scans the structured evidence in the chart for criteria that match diagnoses the LLM never proposed. That is how pulmonary embolism surfaced. Pleuritic pain and a recent viral state are positive criteria for PE, and positive criteria count as evidence for a candidate whether the LLM listed it or not.
3. What it means at scale
Across a year of benchmark runs on this suite, the combined system's hit rate moved from the low 90s to near 100 percent as the benchmark exposed edge cases and the system's coverage matured. Across those same runs, roughly one in three of the LLM's top-ranked proposals is rejected by the verification step. The clinically meaningful number is the rejection rate, because every rejected proposal is a plausible-but-wrong ranking that does not reach the clinician.
A rejection at the verification step is a candidate that the LLM ranked highly and that the structured evidence in the chart will not support. Across any meaningful volume of clinical encounters the system processes, that one-in-three floor implies a large and steady share of plausible-but-wrong rankings that would otherwise appear in a differential. Each one is a candidate a clinician might have seen, weighed, and possibly ordered tests against, on the strength of a pattern match the chart does not back.
This is why a headline accuracy number is the wrong place to start. A 98 percent top-1 accuracy with no audit trail is worse for clinical deployment than a 95 percent top-1 accuracy with a per-diagnosis evidence chain. Without that chain, the clinician has no way to inspect why any single ranking landed where it did. What the verification step produces, beyond a re-ranked list, is the audit trail itself.
4. Why this matters for clinical deployment
The audit trail does concrete work. An accepted diagnosis arrives with the chain of evidence that supported it: which diagnostic criteria the verification step checked, which were met in the structured evidence in the chart, which were absent. A clinician looking at the verified ranking sees the candidates and the reasoning each candidate survived, candidate by candidate. The LLM's ranking is opaque on this dimension. The priors that produced it are baked into the model and are not legible from the output. The verification step writes a per-diagnosis record because writing one is what the step does.
The audit trail does a second kind of work too. Rare conditions are scored on the evidence in this chart. Frequency priors do not override that score. The clinical implication shows up in primary care, where long-tail conditions matter more in aggregate than the handful of headline-frequency diagnoses in any single encounter. A system that ranks by base rate misses the rare-but-supported diagnoses at exactly the volume the long tail predicts. That reverse-direction reading is how pulmonary embolism surfaced in the opening case. A ranking driven by base rates alone would never have surfaced it; the chart's evidence did.
When a diagnostic decision is questioned post-hoc, whether by a colleague, an audit, or a legal proceeding, the evidence chain that supported the accepted candidate is the document the clinician produces. That record exists at the moment the diagnosis is accepted, with the checked criteria, the chart evidence that satisfied them, and the rejected candidates and the reasons for those rejections already attached. No reconstructive guessing is required, because the reasoning was written down. This is what an audit trail looks like in deployment. The gap between the 98-percent number and the 95-percent number from the previous section is documentary, and this is where it lands.
5. What comes next
Shellfinity is looking for clinical-informatics partners to evaluate this engine on real chart data. The right partner is a clinical informatics director or healthcare-AI lead with a deployment context: a chart corpus the team can run the engine against, a working definition of what a useful differential looks like in that setting, and the latitude to inspect the per-diagnosis evidence chain on accepted and rejected candidates alike. Evaluation is structured as a bounded engagement on a representative slice of the corpus, with the audit trail as the primary deliverable.
If this is useful:
- Subscribe to the rest of the Vertical Use-Cases series: shellfinity.substack.com
- Evaluate Shellfinity for medical: shellfinity.com/medical
- Direct correspondence: [email protected]