Methodology · Buildacase

The three leaderboards in one line each

Debaters

BP debaters ranked by a 50/50 blend of speaker rating (how their speeches scored) and outcome rating (how their team placed). Both signals are calibrated so a strong performance at WUDC counts more than the same performance at a small regional.

Judges

BP judges placed into S / A / B / C tiers based on the seats they sat, the seniority of those seats (chair before panel before trainee), how deep the round was (outround before bubble prelim before prelim), and the strength of the tournament.

Tournament difficulty

BP tournaments ranked by an empirical strength score. Five signals combine: top-decile attendance, recent-WUDC elite attendance, 75th-percentile rating, adjudication-core quality, and WUDC-breaker crossover. The result is a single z-score per tab where 0 is the field average.

A refresher on BP scoring

British Parliamentary is a four-team debating format. Each round has four teams of two speakers debating a single motion. Judges rank the four teams from best to worst, awarding team points of 3, 2, 1, 0 to 1st, 2nd, 3rd, 4th respectively. Each speaker also gets an individual speaker score, usually somewhere between 70 and 82.

A standard tournament has 9 prelim rounds. After prelims, the top N teams (by team points, with speaker scores as tiebreaker) advance to outrounds — the knockout bracket. Outrounds run as quarter-finals, semi-finals, grand final, etc. depending on the size of the break. "Breaking" means advancing to outrounds. The "bubble" is the round just before the break, where a team's qualification is on the line.

Two concepts the rest of this page uses:

Z-score

A way of comparing performance against a field. A score of +1.0 means "one standard deviation above the field average" — strong but not exceptional. A 78 at WUDC isn't the same as a 78 at a regional, because score philosophies differ. Z-scoring within a tournament removes that.

Percentile

Where someone sits in the overall rated population. The 90th percentile means better than 90% of rated debaters. The displayed composite on the leaderboard is a percentile, not a raw score, so it's comparable across eras.

Provisional

A flag on debaters and judges with fewer than 8 tournaments in our dataset. The number's still shown, but it's based on thin data and could move significantly with more tabs. Don't trust a provisional ranking the same way you'd trust an established one.

Where the data comes from

Every tournament we rate started as a public Tabbycat tab. Tabbycat is the open-source platform most BP tournaments use to run their schedule and publish results. We download a snapshot — speakers, teams, per-round scores, room pairings, motions, breaks — and that snapshot is the input to everything else.

If a tournament didn't run on Tabbycat (older majors used different systems; some grassroots tabs run on Google Sheets), we don't have it. If a tab was hosted on a free Heroku app that's since been deleted (WUDC 2019 Cape Town is the most prominent example), we usually don't have it either, though we sometimes reconstruct from public articles.

High-school tournaments, World Schools tournaments, and middle-school tournaments are all kept in our dataset but excluded from these leaderboards. The university BP leaderboard wouldn't read sensibly if a debater's z-score at a HS tab counted toward their university rating. A high-school ladder using the same data is on the roadmap.

How the debater rating is built

Empirical. From public Tabbycat tabs. Updated weekly.

What it ranks

Every British Parliamentary debater we have data on, by an empirical rating built from public Tabbycat tabs.

The two signals

Speaker rating. Each speaker score gets z-scored against the rest of that tournament's field, removing the different scoring scales tournaments use. We then take a recency-weighted average per debater on an 18-month half-life. A 2020 result counts about 12% of a 2026 result. Outcome rating. A Massey rating built from BP team-points per round. Each pairing contributes a margin. The system absorbs strength of schedule: beating high-rated opponents moves your rating more than beating low-rated ones. The two ratings agree most of the time (Pearson ~0.89 in our data). When they don't, the gap says something real. A strong speaker on weak teams ends up with a high speaker rating and a middling outcome rating. A consistent room-winner whose speaks are workmanlike ends up the opposite.

Composite

A 50/50 blend of the two signals after normalising both to the same scale. The displayed number is a percentile rank across the rated field. The 50/50 weight is a default, not a derived choice. Tuning it against held-out validation is on the roadmap.

Inclusion

Any debater with 2+ cached BP tabs gets a row in the database. The displayed top 100 only includes debaters with 3+ tabs; thinner-data names appear in the "near miss" panel below the table. Bayesian shrinkage handles the low-tab uncertainty problem inside the rating itself. Start by assuming every debater is average (rating = 0), then update toward their observed data as evidence accumulates. A 2-tab debater gets pulled hard toward zero. A 15-tab debater is trusted almost entirely. The population variance and observation noise are computed from the established-debater subset at runtime; no hand-tuned hyperparameters. The near-miss panel is the recognition that 1-2 tab elites exist but shouldn't pollute the main ranking. It also doubles as a CTA: if you have a tab one of those debaters competed at that we don't have, send it our way.

Why Massey, not Plackett-Luce

Plackett-Luce (PL) is the textbook model for "rank these four teams" data. It's a generative model: the 1st-place team is drawn from a softmax over team strengths, then 2nd from the remaining three, and so on. We tried it. Two consecutive PL runs on identical code and data produced completely different top-5 rankings, with zero overlap. The solver was finding different local optima or hitting the iteration cap before converging. Outcome rating ranges varied by 0.44 points between runs. Massey on the same data produced stable, differentiated rankings. The deeper reason PL struggled: it assumes judges decide by sequential elimination. BP judges don't work like that. They think in margins: "Team A was clearly first, Team D clearly last, B and C were close." The deliberation is about assigning quality scores, not running a sequential pick. Massey fits that process. It estimates latent quality scores which explain observed margins, rather than modelling a selection procedure the judges never ran. What matters is matching the process that generated the data, not picking the model that's most defensible in the abstract. We archived PL and stayed with Massey. See the [full write-up](/blog/how-buildacase-ranks-debaters) for the table of results.

How well it predicts

We ran a held-out validation: trained ratings on BP rounds before 2024, then predicted 2024-2025 rounds cold. Best λ = 4. Drop set (rooms with at least one rated team): Brier 0.176, top-1 accuracy 39.8%. Keep set (full test pool, field-mean imputation for fully-unrated teams): Brier 0.181, top-1 32.4%. Both decisively beat the random-uniform baseline of 0.188. In-sample Brier (same rounds used to compute and test ratings) is 0.164, so the held-out drop gap is +0.012 Brier and the field-mean cost is another 0.005 on top: the model isn't obviously overfitting. The two numbers answer different questions. Drop asks "is the rating signal real on rooms where the model knows the teams?" Keep asks "what's the predictive accuracy across the actual post-2024 field, including the 49% of rooms where every speaker is a post-2024 newcomer?"

What it doesn't do

Only counts BP tournaments we have a cached Tabbycat snapshot for. Pre-2021 majors are still being indexed. Names are matched by a sorted-token key to handle word-order variation, but debaters who appear under different middle names across tournaments sometimes split into two rows. There are 21 known cases. The 50/50 blend weight is a default, not a validated optimum.

How the judge tiers are built

Tier-based. Public Tabbycat tabs + Chief Adjudicator data. Updated weekly.

What it ranks

Every BP judge we have data on, by a tier-based rating from public Tabbycat tabs and Chief Adjudicator data.

How it scores

Each round a judge sits gets a value: tournament strength × ordinal weight (chair > panel > trainee) × round depth (outround > bubble prelim > prelim). Contributions are sorted descending and multiplied by an ordinal weight curve [1.0, 0.75, 0.55, 0.42, 0.32, 0.24, 0.17, 0.12] before summing. Peak work matters more than padding the count with mid-tier appearances. A busy judge with one WUDC chair and six small-IV chairs scores differently from one with two strong continental-major chairs. A co-panel multiplier adds to the score if you regularly sit with top-rated judges. Chief Adjudicator seats add a bonus on top, weighted by tournament strength. A WUDC Chief Adjudicator seat is worth more than a small-IV invite.

Tiers

Judges are placed into S, A, B, C tiers by a 3-component Gaussian mixture fit on the top-100 scores. Each judge belongs to whichever component gives the highest posterior probability. Component means come out cleanly separated: S: mean score ~244, ~33% of top-100 A: mean score ~173, ~40% of top-100 B: mean score ~122, ~27% of top-100 The old method found the largest score gap in positions 5-22 of the ranked field and placed the S/A boundary there. That approach was pattern-matching on noise: adding or removing one tournament shifted the boundary by 3-5 names. The mixture model shifts it by 0-1 names.

Inclusion

Same 3-tab minimum as the debater leaderboard: judges with 3+ cached BP tabs appear in the displayed top 100. Anyone with 1-2 tabs whose score would qualify them shows up in the "near miss" panel below the table. The stored rating works on a broader floor (anyone the system has seen sit a panel) but the displayed leaderboard restricts to 3+ for the same reason as debaters: a single-tab elite score isn't a stable signal.

What it doesn't do

Tabbycat only flags Chief Adjudicators and Deputy Chiefs, not the broader tab team, so quieter Chief Adjudicator work is invisible to us. Tournament strength is empirical: derived from who actually showed up. A small but elite invitational reads correctly without being on a hardcoded "majors" list.

How tournament strength is scored

Bubble-path difficulty. The question we answer: how hard is it to break here?

The question we're answering

Tournament strength should answer one specific question: how hard is it to break at this tournament if you're aiming for it? This isn't the same as "how impressive is the field on paper." A 400-team WUDC has the most elite debaters in absolute count, but power-pairing isolates the bubble from those elites for most of the tournament. By the time you're fighting for the last break spot, you're likely in a room with other 14-point teams, not with the world champions running away at the top. Sharper, smaller tournaments like LSE Open or Doxbridge concentrate the elite in a narrow break field. There's nowhere to hide — after two or three rounds you're in their rooms whether you like it or not. The bubble path is genuinely harder even though the absolute count of WUDC-level debaters is lower.

What the bubble is, exactly

A team is on the bubble entering the last two rounds if their break/no-break status still depends on those last two rounds. In BP, you can earn 0 to 6 team points in two rounds (two 4ths to two 1sts). So if break_cut is, say, 17 points and your team has 11–16 points after round 7 of a 9-round tournament, your break fate is undecided — you could still make it, you could still miss it. That's the bubble. Teams who have already clinched the break (already at break_cut or above) aren't on the bubble. Teams who are mathematically out (below break_cut − 6) aren't either. The bubble is the contested middle.

How we measure path difficulty

For each bubble team, we look at the opponent teams they played in their last two prelim rounds — that's up to 6 opponent teams (3 per room in BP). For each opponent, we look up the speakers' debater ratings. We take the rolling average of each opponent speaker's ratings from the 3 tournaments before this one and the 3 after — a 6-tournament window centered on the tab. This catches "current form at the time" rather than career averages that include speakers' future achievements. We average across all opponent speakers a bubble team faced in their last two rounds. That's the team's path-strength number. We then take the median across all bubble teams at the tournament. That's the tournament's pathRoll. A higher pathRoll means the bubble had to play harder opponents to break. That's the headline difficulty.

Why we use a rolling 6-tournament window, not lifetime averages

A speaker's "rating" should reflect their level at the time they competed at this tournament, not their level five years later when they've made the WUDC final. Using career averages would inflate the strength of any tournament that happened to feature someone who got famous afterwards. Using only past observations would miss late-blooming speakers whose strength wasn't visible yet. Three before and three after the tab in question is a fair compromise: it captures recent form without leaning too far into the future.

Why round robins use a different formula

At a round robin like HWS RR, every team plays every other team across the prelims — there's no power-pairing. So every bubble team's opponents are identical: the rest of the field. That means the "bubble path" question collapses. The path is the field. So for round robins we use fieldStrength: the mean rolling rating of all speakers in the field. Same idea, simpler math, honest about what the metric is measuring. At HWS RR specifically, this means difficulty is essentially the average strength of the 16 best teams in the world. Those tabs sit at the top of the difficulty ranking on this metric, where they belong.

A secondary signal: break-cut top-heaviness

There's a useful secondary signal for non-round-robin tournaments: break_cut − 2 × prelim_rounds. The 2-points-per-round number is the BP expected average (a 2nd-place team every round). When the break cut sits above that, the field was top-heavy: many teams clustered at the top forced the cut up. For a 5-round tab where break_cut is 10, this signal is 0 — break-cut at average. Break_cut of 11 or 12 means the field was unusually top-heavy. Break_cut of 9 means it was unusually thin at the top. We weight pathRoll 70% and break-cut signal 30%. The pathRoll is the more direct measure, but the break-cut signal helps with tournaments where individual bubble draws were noisy. WUDC and round robins are exempted from the break-cut signal because their format makes the math non-comparable.

What this changes vs the old formula

The old formula used five signals about who was in the field: top-decile attendance count, recent-WUDC elite count, AC quality, etc. All measured field composition. None measured what the bubble actually played against. With the new formula, WUDC drops sharply in the rankings — because while it has the deepest field in absolute terms, the bubble is well-protected from the elite by power-pairing. Doxbridge, LSE Open, KCL Open, and similar sharp invitationals rise. HWS Round Robin tabs leap to the top because every team plays the entire elite field, with no protection.

What it doesn't do, honestly

This metric only works on tournaments where we have pairings data — for power-paired tabs, we need to know who faced whom in the last two rounds. Tabs that died with their Heroku dynos before we could fetch them are missing entirely. We don't adjust for motion quality, organizational running, or how good the judging was beyond what it implies for speaker ratings. The metric is one question — "how hard was it to break here?" — and doesn't pretend to be anything broader. The bubble itself is defined mechanically, not from final standings. If a tournament's break math is unusual (very small fields, unusual break sizes), the bubble might be sparse or empty. We need at least 2 bubble teams with enough rated opponents to compute a meaningful median. Tabs without enough signal get no strength score rather than a noisy one.

How the three leaderboards relate

Shared source data. Same recency window.

Shared data, shared recency

All three leaderboards share the same source data and recency window. The debater ratings feed into the other two. Tournament strength is computed from the rolling ratings of the bubble teams' opponents — so when debater ratings update, tournament strengths update on the next cron run, and then judge ratings (which use tournament strength as a multiplier) update after that. This means changes propagate automatically. If we add a new tab and recompute the leaderboard, the new tab's strength reflects who actually showed up. Existing tabs' strengths shift slightly because their opponents' rolling ratings now include the new tab. The judge ladder then updates with the new tournament-strength multipliers, all in one cron pass. The tournament strength multiplier (strengthMul) is what flows back into the debater ratings: when computing a debater's composite, observations at higher-strength tabs count for more. This is the explicit handoff between the three leaderboards.

What none of these do

Out of scope

Rate WSDC, AP, schools formats, or any non-BP format. Coming. Account for non-Tabbycat tournaments. If a tab ran on a different system or a Google Sheet, it isn't here. Predict future results. The math summarises what happened, not what will. Make moral claims. A high rank means "appeared often at strong tabs with strong results," not "is a better person" or "should win their next round." Update in real time. The cron runs after each big import; ratings refresh within a day.