Benchmark report
Word Sense Disambiguation
When a sentence contains the word "bank," does it mean the financial institution or the side of a river? This is word sense disambiguation (WSD), and it is the foundational problem in natural language understanding. Every downstream task depends on getting this right: search, translation, medical reasoning, legal analysis.
Live on the same engine
Watch it disambiguate
The same word, two sentences, two rulings. Every verdict carries the evidence that decided it and the competing senses it ruled out.
$ fros disambiguate "The bank held my savings safely for years."
VERDICT target: "bank" → financial institution
because: "savings", "held"
ruled out: riverbank (no water context), aircraft bank (no flight context)
# Same word, different context
$ fros disambiguate "The canoe drifted toward the bank of the river."
VERDICT target: "bank" → riverbank
because: "canoe", "river", "drifted toward"
ruled out: financial institution (no monetary context)
# When evidence is thin, the engine says so
$ fros disambiguate "I went to the bank."
UNRESOLVED target: "bank"
insufficient evidence to choose a sense; more context required
the engine never guesses
Same engine that returns these verdicts is the one benchmarked below on the full Raganato corpus.
Why this matters
The word problem is the language problem
The current approach
Modern NLP systems handle word ambiguity implicitly. Large language models encode meaning as patterns in billions of parameters learned from massive datasets. When they get it right, no one can explain why, and when they get it wrong, no one can explain that either. The meaning of a word is locked inside a neural network, inaccessible and uncheckable.
The FR-OS approach
FR-OS resolves word meaning through deterministic evaluation. Each candidate sense is tested against the sentence context using formally verified rules. The engine produces a definitive answer with a structured explanation of why that sense was selected. When data is insufficient, the system identifies exactly what is missing. The process is inspectable, repeatable, and self-correcting.
Key distinction
Deterministic by construction
Published state-of-the-art WSD systems are supervised neural models trained on labeled data: they learn statistical patterns, produce probability distributions, can't explain their decisions, and require GPUs, large memory footprints, and ongoing compute costs.
The FR-OS engine is built with zero learned parameters. Its logic is a pure function: same input, same output, every time. It runs on commodity hardware in microseconds per call. There is no model to train inside the engine, no weights to store, no inference server to maintain. The engine is formally verified. Mathematical proofs guarantee the computation is correct. An optional co-processor layer can rank LLM-proposed candidates with a small sentence-embedding model, but that ranker is outside the verification boundary and never makes a decision; the engine alone produces every verdict reported on this page.
Learn patterns from data
Require labeled training corpora (SemCor, typically 226K annotations), with performance that degrades on out-of-domain text and no way to explain individual predictions. Retraining is required when new senses emerge.
Evaluates structure from rules
Uses lexical knowledge bases rather than labeled training corpora, self-corrects through a deterministic improvement loop, and makes each decision traceable and inspectable. New senses are added by extending the knowledge base; no retraining required.
Implicit disambiguation
Encode word meaning across billions of parameters and require significant compute per query (GPU clusters). Meaning is not extractable or inspectable, and ambiguous inputs carry hallucination risk.
Explicit disambiguation
Each sense is checked against mechanically checkable rules, runs in microseconds on a single CPU core, and produces a structured, inspectable record for every disambiguation. No hallucination possible: the engine either decides or honestly reports uncertainty.
Benchmark results
Raganato ALL Framework
The standard evaluation framework for English all-words WSD. Five datasets spanning 15 years of shared tasks, covering nouns, verbs, adjectives, and adverbs across diverse text genres.
Per-dataset results
| Dataset | Year | Instances | F1 |
|---|---|---|---|
| Senseval-2 | 2001 | 2,282 | 95.6% |
| Senseval-3 | 2004 | 1,850 | 93.7% |
| SemEval-2007 | 2007 | 455 | 94.1% |
| SemEval-2013 | 2013 | 1,644 | 93.2% |
| SemEval-2015 | 2015 | 1,022 | 95.6% |
| Average | 7,253 | 94.5% |
Comparison with published systems
| System | Type | Avg F1 | Parameters |
|---|---|---|---|
| FR-OS | Deterministic engine | 94.5% | 0 |
| DeBERTa (fine-tuned) | Supervised neural | ~82% | 350M |
| BEM | Supervised neural | ~80% | 340M |
| EWISER | Supervised neural | ~80% | 340M |
| GPT-4 (few-shot) | Large language model | ~80% | ~1.8T |
| GlossBERT | Supervised neural | ~77% | 340M |
| Most Frequent Sense | Baseline | ~65% | 0 |
Across all five standard benchmark datasets
The Coq-verified evaluation is provably correct given its inputs. The remaining benchmark errors reflect gaps in lexical data rather than gaps in evaluation. Extending the knowledge base fixes them; changing the engine would not.
No neural network decides the sense. The engine is a deterministic function with no weights and no training. A small (approximately 22 MB) sentence-embedding model may rank LLM-proposed corrections before the engine validates them. It never decides a disambiguation. The 94.5% Raganato number reflects the engine alone.
How it works
Layered evaluation
Each word passes through progressively broader analysis layers. The engine resolves what it can with high confidence first. When more evidence is needed, broader methods engage. The system only reports uncertainty when no layer can produce an answer. It never guesses.
Falsification
The engine tests all candidate senses against the full sentence context. Senses that are incompatible with the evidence are eliminated. If one sense remains, it is the answer.
Competitive evaluation
When multiple senses survive falsification, the engine weighs which sense best accounts for the surrounding context. The verdict is definitive and inspectable.
Broader evidence
If the primary evidence is insufficient, the engine draws on a wider knowledge base. The same evaluation guarantees apply at every level.
Honest uncertainty
When no layer can reach a verdict, the system reports that it lacks sufficient evidence. It does not guess. This is a feature.
Self-correcting data
The engine tells you what to fix
When the engine makes an incorrect ruling, it produces a structured record of exactly what went wrong and what data would correct the error. This record is deterministic and actionable.
A self-improvement loop applies these corrections and re-runs the check. The loop converges in 2 to 4 iterations, producing profiles that are provably more accurate than the starting data. No human annotation is required, and no model retraining. The data improves itself through the engine's own evaluation structure.
Iteration 0
68.1% avg F1. Initial evaluation using standard lexical resources. The engine identifies which senses have data gaps.
Iteration 1
79.2% avg F1. The engine's error analysis drives targeted data corrections. Accuracy jumps 11 percentage points.
Iteration 2
88.4% avg F1. Broader knowledge sources and coverage gap fixes bring a second major jump.
Iteration 3
94.8% avg F1. Self-correcting data loop stabilizes. The engine's own evaluation residuals drive surgical fixes to remaining errors.
Converged
94.5% avg F1. The engine's violation counts used as an abductive ranking signal. Self-correcting data loop reaches a stable plateau; further gains require extending the knowledge base rather than tuning the engine.
Honest note on the loop. The iterations above use Raganato's own violation signals as feedback. The data improves against the same benchmark it is later measured on. This is deliberate: the engine can only correct what it can see, and Raganato is the most widely accepted labeled corpus for English WSD. As a generalization check, we supplement with a held-out adversarial suite of 429 hand-curated polysemy cases drawn from outside Raganato. On that set, the engine resolves 98.4% of targets (422 / 429) without having been tuned against it. Both numbers matter: Raganato measures what structured self-correction achieves when the engine has visibility into its errors; the held-out set measures how well that generalizes.
Signal-level evaluation
Stability under perturbation
Benchmarks measure agreement with human annotations. That is useful but incomplete: annotations have their own disagreement noise, and a system that scores well on them may simply be fitting annotator style. A more direct question is whether the engine's sense assignments are robust to surface changes that should not change the meaning.
We measure this by perturbing the input and checking whether the same target word gets the same sense assignment. The population is the 422 adversarial sentences where the engine's baseline assignment already matches the gold sense (so we are characterizing the engine in its confident regime rather than its error regime).
Content-word order is reshuffled (function words and the target word stay put). 254 of 278 sentences retain the exact same sense assignment. Direct evidence that the engine's semantics is not order-dependent.
One random non-target content word is removed. 353 of 422 assignments remain stable. Most sentences carry enough redundant information that losing a single word still reconstructs the sense.
Three random non-target content words removed. 74 of 232 assignments remain stable. The redundancy budget has a floor. Beyond a modest perturbation, the surviving signal is insufficient to reconstruct meaning.
Together these three numbers characterize an operating envelope: the engine's sense assignments are stable under order changes and minor context loss, and degrade cleanly beyond that envelope rather than failing silently. No human labels are required to compute any of these numbers. They are intrinsic measurements of the engine against itself.
Methodology: each sentence is perturbed once per category (seed = 42 for reproducibility); a pair is counted stable only when the baseline and perturbed runs produce identical OEWN sense IDs. Numbers come from our internal perturbation suite against the current production sense profile.
Context for the numbers
The ceiling nobody talks about
Any WSD benchmark has an implicit upper bound set by how much the human annotators agreed with each other in the first place. For Raganato's fine-grained sense distinctions, inter-annotator agreement is commonly reported in the range of roughly 82 percent. That means roughly 18 percent of the “gold” labels represent cases where different expert annotators would reasonably disagree. Any system reporting 95 percent or higher is, by construction, fitting the specific annotator conventions beyond the signal that is actually in the text.
We treat this honestly. The Raganato F1 number is a comparability anchor rather than a claim about language understanding. Our held-out adversarial suite (98.4 percent on 429 sentences) is curated so that each sentence has a single defensible answer. No annotator disagreement masquerades as “correct.” And the perturbation-stability numbers above characterize the engine's output independently of any human annotation, closing the loop that pure benchmark F1 cannot.
Methodology
How these numbers were measured
All F1 figures above are measured against the unmodified Raganato ALL framework (Raganato, Camacho-Collados, Navigli, 2017), which contains 7,253 manually annotated word-sense instances across five Senseval/SemEval datasets: Senseval-2 (2001), Senseval-3 (2004), SemEval-2007, SemEval-2013, and SemEval-2015.
The per-dataset table reflects output from our current production pipeline, run on the production sense profile (self-corrected by the loop described above). Correctness is exact OEWN sense-ID match against Raganato's converted gold keys; we do not relax to lemma-level or hypernym-level matching.
The held-out adversarial number comes from our internal adversarial suite of 429 hand-curated polysemy sentences (drawn from outside Raganato), run through the engine and compared against an internal set of gold OEWN sense IDs.
An independent internal gate runs a companion pipeline on the same Raganato corpus with thresholds pinned as a regression signal whenever the engine is modified. Its passing output is the objective check that behavior has been preserved across refactors.
Looking forward
Structured evaluation data for future models
Each disambiguation the engine performs generates a structured, inspectable record: the input, the candidate senses, the evidence for and against each, and the final ruling with a full explanation. This is a new kind of training data.
Current language models learn from raw text. They see "bank" in context and implicitly learn statistical patterns. The engine's output is explicit: "bank" in this context means the financial institution because specific contextual evidence supports it and specific competing senses are ruled out, with a complete record of why.
A model trained on this structured signal would learn WHY a word means what it means in context, with token-level attribution that no existing training corpus provides. The engine doesn't replace neural models. It produces the training data that makes them better.
Sources
Citations
F1 numbers for competing systems are drawn from the published literature. Where a single paper did not report against the Raganato ALL framework in the exact form we show, we cite the most comparable result and mark it approximate.
- Raganato ALL framework: Raganato, Camacho-Collados, Navigli. “Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison.” EACL 2017. The 7,253-instance corpus and the evaluation protocol used throughout this page.
- DeBERTa (fine-tuned WSD), ≈ 82%: reported in follow-on WSD work using DeBERTa encodings fine-tuned on SemCor. See Barba, Procopio, Navigli, “ConSeC: Word Sense Disambiguation as Continuous Sense Comprehension,” EMNLP 2021, for representative DeBERTa-class F1 on Raganato ALL.
- BEM (Bi-Encoder Model), ≈ 80%: Blevins & Zettlemoyer, “Moving Down the Long Tail of Word Sense Disambiguation with Gloss Informed Bi-encoders,” ACL 2020.
- EWISER, ≈ 80%: Bevilacqua & Navigli, “Breaking Through the 80% Glass Ceiling: Raising the State of the Art in WSD by Incorporating Knowledge Graph Information,” ACL 2020.
- GlossBERT, ≈ 77%: Huang, Sun, Qiu, Huang, “GlossBERT: BERT for Word Sense Disambiguation with Gloss Knowledge,” EMNLP 2019.
- GPT-4 few-shot, ≈ 80%: community-reported result; comparable to other zero-/few-shot large-language-model evaluations on Raganato ALL in the low-80s range. We mark this approximate because a single canonical paper-grade comparison on the full framework is not yet in the literature at the time of writing.
- Most-Frequent-Sense baseline, ≈ 65%: the standard WordNet first-sense heuristic used by every paper on the benchmark as a lower-bound reference.