The Buildacase leaderboard ranks BP debaters from public Tabbycat tabs. Each tournament contributes two numbers per debater: a speaker rating (how their speeches scored relative to that tournament's field) and an outcome rating (how their team placed against the room they were in). The displayed percentile is a 50/50 blend.
This post walks through how each of those numbers is calculated, why we picked Massey over the more theoretically correct Plackett-Luce, how shrinkage handles low-tab debaters, and where the model still falls short.
Jump to: Two signals · Plackett-Luce experiment · Why Massey won · Bayesian shrinkage · Tournament strength · Judge tiers · What we got wrong · What's next
Two signals
The speaker rating is a within-tournament z-score. A 78 at WUDC isn't the same as a 78 at a regional, because score philosophy, panel generosity, and tab culture all vary. Z-scoring within each tournament removes that. We also recency-weight on an 18-month half-life, so a 2020 result counts about 12% of a 2026 result.
The outcome rating does something different. Given who was in the room, how good a result was 1st (or 2nd, or 3rd)? Winning a room of four WUDC finalists is worth more than winning a room of first-timers. The rating absorbs strength of schedule automatically.
The two signals agree most of the time (Pearson ~0.89 in our data). When they don't, the gap is useful. A strong speaker on weak teams ends up with a high speaker rating and a middling outcome rating. A consistent room-winner whose speaks are workmanlike ends up the opposite.
The Plackett-Luce experiment
The outcome rating uses Massey rating: a least-squares solver on team-point margins. Massey treats the gap between 3rd and 4th as equivalent to the gap between 1st and 2nd, which isn't quite right. BP outcomes are bounded at 0, 1, 2, or 3. They aren't a continuous interval scale.
The standard model for "rank these four teams" data is Plackett-Luce (PL). It's a generative model: the 1st-place team is drawn from the pool of four using a softmax over team strength parameters, 2nd is drawn from the remaining three, and so on. It's the textbook approach for preference data in economics and ML. I tried it.
Two consecutive PL runs on the same code, same data, same database produced completely different top-5 rankings:
| Rank | Run 1 | Run 2 |
|---|---|---|
| 1 | Clement Tsao (Penn, 8 tabs) | Udai Kamath (Sydney, 5 tabs) |
| 2 | Will Ryan (UCLA, 17 tabs) | Chris Mentis Cravaris (Imperial, 6 tabs) |
| 3 | Matt Mauriello (Harvard, 9 tabs) | Tejas Subramaniam (Stanford, 13 tabs) |
| 4 | Laura Pilmark (independent, 8 tabs) | Xiao-ke Lu (Princeton, 4 tabs) |
| 5 | Aidan Woo (Oxford, 10 tabs) | Max Rosen (McGill, 9 tabs) |
Zero overlap. The model wasn't just unstable in some statistical sense; different runs were finding different local optima or hitting the iteration cap before converging.
The outcome rating ranges told the same story. Run 1 had a top-50 spread of 0.83; Run 2 had 1.27; Massey produced 1.60 on the same data. PL was barely differentiating debaters, and what differences it did produce were noise.
There was also a false-positive problem. One mid-field debater showed up in the PL pandemic-era top-100 with an outcome rating 0.73 points above their speaker rating. Massey had them outside the top 50, which is closer to right.
Why the "wrong" model won
The PL failures have technical causes: sparse comparison graphs (most debater pairs never meet directly), an identifiability problem (PL only pins down strength parameters up to an additive constant), and slow convergence through the "hub" tournaments connecting most of the network. All fixable with engineering effort.
Fixing them wouldn't really change the answer, though, because the underlying issue is the generative model itself, not the implementation.
PL assumes judges decide sequentially: pick the best team from all four, then pick the best from the remaining three, then the best from the final two. That's how it constructs the probability of any ranking.
Judges don't think like that. A BP judge looks at a room and thinks in margins. "Team A was clearly the best. Team D clearly the worst. B and C were close, I gave the edge to B." The deliberation is about latent quality scores, not sequential elimination. Ranking is the byproduct of ordering by quality, not the input.
Massey is closer to that. It fits latent quality scores that explain the observed margins, rather than modeling a selection procedure. The math is "wrong" for ranked data in the abstract, but it's the right fit for the data BP judges actually produce.
So the choice isn't between models that are correct or incorrect in the abstract; it's between models that match the data-generating process and ones that don't. PL would probably be the right call for horse racing, or consumer choice, or anything where the decisions are genuinely sequential. For judged-quality data, Massey fits better.
We archived the PL implementation and stayed with Massey.
Bayesian shrinkage
The old inclusion gate was a hard cutoff: 8 tabs to appear in the leaderboard, 4–7 for a provisional rating, fewer than 4 invisible. A debater with 3 tabs was missing; their 4th tab made them appear.
That's been replaced with Bayesian shrinkage. Start by assuming every debater is average, then update toward their observed data as evidence accumulates.
The posterior-mean rating is a weighted average of the population mean (zero) and the debater's observed rating. The weight on observed data grows with sample size. A 2-tab debater gets pulled toward zero. A 15-tab debater is trusted almost entirely. The population variance and observation noise are computed from the dataset's established debaters at runtime, so there are no hand-tuned hyperparameters.
The practical effect: the leaderboard grew from 586 rated debaters to 2,595. The top-100 is structurally unchanged. Low-tab names appear only in the long tail with appropriately uncertain ratings. Brier score (a measure of how well the ratings predict actual round outcomes) dropped from 0.178 to 0.164:
| Model | Brier score | Top-1 accuracy | Rounds |
|---|---|---|---|
| Random baseline | 0.188 | 25.0% | — |
| Massey (pre-shrinkage) | 0.178 | 39.5% | 13,501 |
| Massey + shrinkage | 0.164 | 47.6% | 13,501 |
These are in-sample numbers — same rounds used to compute the ratings, same rounds used to test them. The held-out numbers are in the next section.
Tournament strength
A first place at WUDC is worth more than a first place at a novice IV. The tournament strength score captures that.
We compute five signals per tab:
- recentElite: attendees who placed top-50 at a WUDC in the last 3 years
- top10pct: how many people in the overall leaderboard's top 10% showed up
- wudcBrk: WUDC breaker crossover
- acAvg: mean rating of the adjudication core
- p75: 75th-percentile attendee rating
These used to be averaged equally. The five signals all measure roughly the same thing (field prestige), so equal averaging double-counted the dominant axis.
A PCA across all BP tabs gives a first principal component that explains 63.5% of the variance. Each signal now gets weighted by its PC1 loading:
| Signal | PC1 weight |
|---|---|
| recentElite | 0.250 |
| top10pct | 0.241 |
| wudcBrk | 0.233 |
| acAvg | 0.182 |
| p75 | 0.093 |
p75 collapsed to 0.09: it's the most redundant, largely captured by the elite counts. Recent WUDC presence leads because it's the cleanest "the field was strong" signal we have.
Judge tiers
The leaderboard also rates judges. The old tier-break system found the biggest score gap in positions 5–22 of the ranked field and placed the S/A boundary there. That's pattern-matching on noise: adding or removing one tournament shifted the S-tier boundary by 3–5 names.
Tiers now come from a 3-component Gaussian mixture fit on the top-100 judge scores. Each judge belongs to whichever component gives the highest posterior probability. The components come out cleanly separated:
| Tier | Mean score | Share of top-100 |
|---|---|---|
| S | ~244 | 33% |
| A | ~173 | 40% |
| B | ~122 | 27% |
Removing a single tournament now shifts boundaries by 0–1 names. The tier assignments are stable.
What we know we got wrong
The held-out numbers. We ran the held-out version: trained ratings on BP rounds before 2024, then predicted 2024–2025 rounds cold. Best Brier was 0.176 at λ=4 on the cleaner drop-set (rooms with at least one rated team). For honest deployment, we also computed a keep-set where rooms with all four teams unrated stay in and get field-mean imputation: 0.181 Brier on the full 10,429-room test pool. Both decisively beat the 0.188 random baseline. The gap from in-sample (0.164) to held-out drop (0.176) is +0.012 Brier — small enough that the model isn't obviously overfitting.
The honest caveat: 49% of post-2024 teams have both speakers fully unrated (they entered the circuit too recently for pre-2024 data to exist). The drop-set hides those rooms; the keep-set keeps them and pays the predictive cost. We're publishing both because they answer different questions: drop is "is the rating signal real on rooms where the model knows the teams?" — keep is "what's the predictive accuracy across the actual post-2024 field?"
Missing tabs. Pre-pandemic data is patchy. A debater who was dominant in 2018–2019 but retired before COVID may be systematically underrated, because the rating evidence isn't in our dataset.
21 name collisions. A recent fix solved cases where different word-orderings of the same name ("Sher May Nar" vs "Nar Sher May") were creating duplicate rows. That fix exposed a different gap: "Anders Woodruff" in some tabs and "Anders Cairns Woodruff" in others now show as two rows. There are 21 of these. They need a hand-curated alias table or a phonetic secondary key. Neither is shipped.
The 50/50 blend. The composite percentile splits speaker and outcome ratings equally. That split is a default, not a derived choice. The right weight comes from held-out validation. Until that's done, 50/50 is a guess.
What's next
Now that held-out numbers exist, the next priority is tuning the 50/50 blend weight against the held-out Brier instead of leaving it as a default.
Format expansion is on the list too. Everything here is BP-specific. WSDC, APDA, and Australs all have different round structures and different judging conventions. The Massey-vs-PL question might land differently for WSDC, where the deliberation structure differs.
And we're still collecting tabs. The more snapshots we have, the less the shrinkage prior matters, and the more the leaderboard reflects what actually happened rather than what we happened to capture.
If you spot a mistake in any of this — bad number, wrong claim, debater misrated — please let me know.