Combining LLMs Rarely Beats the Single Best Model: the co-failure ceiling

01

If they are all wrong, no vote can win.

Routing, voting, and cascading all hand back one model's answer. So your ceiling is set by how often every model is wrong at once. Call that β. It is not how often models agree (ρ).

Picture a panel of experts where you can only return one expert's answer. Choosing the best one helps, right up until a question lands on a blind spot they all share. Then no rule wins, because the right answer was never in the room.

That is the ceiling, and it is exact. Give a query to a pool of m models. If every one is wrong, no selection policy (router, weighted vote, cascade, debate) can be right, since each returns one member's answer. Accuracy is capped at 1−β, where β = P(all m wrong).

The field reports pairwise correlation ρ instead, and ρ is provably blind to β. You can hold the entire pairwise law fixed and still move β, a Fréchet-class fact we make exact in the paper. A single-factor copula calibrated on ρ underprices the co-failure tail, a bias that grows with pool size, driven by a common-mode atom that no pairwise number represents.

02

Know your ceiling before you build it.

Grade the models once on a held-out set and count the questions all of them missed. That count alone caps what any router could add. No training, no cost. Move the inputs and watch the ceiling.

FIG 1 · Realizability certificateMATH-500 default

all-models-wrong count K

queries n

single-best accuracy

orchestration overhead

·β̂ = K/n

·certified ceiling 1−β_lo (95% CP)

·certified max gain

Verdict pending

Defaults are the paper's MATH-500 run (K=17, n=330, 67 models, β̂ = ·). The Clopper-Pearson lower bound on β turns the count K/n into a certified ceiling 1−β_lo on achievable accuracy, the most any router, vote, or cascade could reach. Subtracting your measured single-best upper-bounds the gain, with ≥95% coverage on β from one labelled sample, no router trained. The point-estimate ceiling is 1−β̂ = 0.948. The certified 1−β_lo above is deliberately wider.

03

Add more models and the gap widens.

The usual estimate reads joint failure off pairwise agreement. It runs low, and runs lower as the pool grows, because the models share blind spots that no pair reveals. Drag the slider to see it open up.

FIG 2 · Underpricing vs pool size (MATH-500)composition-bootstrapped

pool size kk = 67

scroll to pan ↔

Empirical β over the tetrachoric single-factor prediction. Median across random k-model subsets, a 5-95% band. At the full pool k=67 the residual is ·: a common-mode atom, not a calibration artifact. Robustness in the rail.

04

Two regimes, and the task picks one.

On open-ended math and code, every model trips on some of the same questions, so the ceiling bites. On multiple-choice, someone always lands the answer, so β is near zero and combining only breaks ties.

Co-failure (β > 0, with the same Pearson-trap and full-Σ residual) holds on two mathematics benchmarks and execution-graded competitive code, and inverts on multiple-choice science. The lever is open-endedness, not subject matter.

05

Same questions, new format, the ceiling appears.

Take hard science questions. As multiple-choice, models can guess or eliminate, so someone is always right. Remove the options, make them answer cold, and 10 of 79 now stump every model at once.

FIG 3 · Content-controlled format flip5-judge panel · κ 0.73-0.92

≈ 0co-failure β (same questions)

·mean accuracy (matched models)

·all-models-wrong items / 79

scroll to pan ↔

Each cell is one of the 79 GPQA-Diamond items, content held fixed. Toggle only the format and a co-failure block opens at the left edge. β goes from ~0 (multiple-choice) to 0.127, 10 of 79 items where every model is wrong. The tail stays positive under every judge rule (majority 0.127, unanimous 0.241, lenient 0.038), so it is not a grading knob.

06

The cast: 67 frontier models, 21 providers.

Every number here recomputes live over one 2026 OpenRouter pool, from $30/Mtok flagships down to $0.03/Mtok open weights. The roster, the matrices, the grading, and the code are all released to rerun.

67 models · 21 providers · priced live · temperature 0 · one 67×67 co-failure matrix

Tier: frontier · mid · cheap / open-weight. Every instrument above draws on this pool, and all of it is released to replicate: the full roster with live prices, the 67×67 outcome matrices, the grading, and the analysis code. Every number on this page regenerates offline.