Methodology

Four rankings: debaters, judges, countries, and tournaments. This page explains how the math works for anyone who wants to check it. You don't have to read it for the rankings to make sense. Skip ahead to whichever section answers your question.

The four leaderboards in one line each

Debaters

We blend two things at 50/50: how their speeches scored (speaker rating) and how their team did in the room (outcome rating). Both adjust for tournament strength, so the same performance at WUDC counts more than at a local IV.

Judges

Sorted into six tiers by their score out of 10: CA at the top, then Outround Chair, Outround panellist, Prelim chair, Prelim panellist, and Trainee. The score adds up what each judge actually sat: chairs count more than panels, outrounds more than prelims, stronger tournaments more than weaker ones. CA, short for Chief Adjudicator, sits highest, because being handed a whole tournament's adjudication is the surest sign of trust there is.

Tournament difficulty

How hard was it to actually break at this tournament. We look at who the bubble teams faced in their last two rounds and average those opponents' ratings. Round robins use a simpler formula: average the whole field, since everyone plays everyone.

Countries

Ranked by the average standing of each country's strongest debaters, their top handful, so a country with a few world-class debaters and one with lots of solid ones get compared on the same footing.

A refresher on BP scoring

British Parliamentary is a four-team debating format. Each round has four teams of two speakers debating a single motion. Judges rank the four teams from best to worst, awarding team points of 3, 2, 1, 0 to 1st, 2nd, 3rd, 4th respectively. Each speaker also gets an individual speaker score, usually somewhere between 70 and 82.

A standard tournament has 9 prelim rounds. After prelims, the top N teams (by team points, with speaker scores as tiebreaker) advance to outrounds: the knockout bracket. Outrounds run as quarter-finals, semi-finals, grand final, etc. depending on the size of the break. "Breaking" means advancing to outrounds. The "bubble" is the round just before the break, where a team's qualification is on the line.

Two concepts the rest of this page uses:

Z-score

A way of comparing performance against a field. A score of +1.0 means one standard deviation above the field average. Strong, but not exceptional. A 78 at WUDC isn't the same as a 78 at a regional, because score philosophies differ. Z-scoring within a tournament removes that.

Percentile

Where someone sits in the overall rated population. The 90th percentile means better than 90% of rated debaters. The displayed composite on the leaderboard is a percentile, not a raw score, so it's comparable across eras.

Provisional

A flag on debaters with fewer than 8 tournaments in our dataset. The number still shows up, but it's based on thin data and could move significantly with more tabs. Don't trust a provisional ranking the same way you'd trust an established one.

Where the data comes from

Most tournaments we rate started as a public Tabbycat tab. Tabbycat is the open-source platform most BP tournaments use to run their schedule and publish results. We download a snapshot: speakers, teams, per-round scores, room pairings, motions, breaks. That snapshot is the input to everything else.

Not all of it comes from Tabbycat, though. Where a tournament ran on something else, an older system like Tabbie, a Google Sheet, or only a scanned PDF tab sheet, we reconstruct what we can from the public record and feed that in too. EUDC Tallinn 2017 came from a spreadsheet, WUDC Thessaloniki 2016 from an archived page and a break post, the HWS round robins from scanned tab sheets. These reconstructions don't always carry full room-by-room data, but the speaker tabs and break lists still count.

What we genuinely lose is tabs that died with the free Heroku apps they ran on before anyone archived them. Even then we sometimes rebuild from public articles. WUDC 2019 Cape Town is one we pieced back together that way.

High-school tournaments, World Schools tournaments, and middle-school tournaments are all kept in our dataset but excluded from these leaderboards. The university BP leaderboard wouldn't read sensibly if a debater's z-score at a HS tab counted toward their university rating. A high-school ladder using the same data is on the roadmap.

Tiers and the display score

The number you see next to each debater is mapped onto the speaker-score scale debaters already know. It is not a raw speaker score average. It reflects tournament difficulty, head-to-head outcomes, and consistency, then gets placed on a 66 to 90 scale where the numbers feel like what debaters experience in real tabs: a median debater lands around 76, a strong debater around 80, and the very best around 89 to 90.

The ranking order underneath is set by the rating itself. When two debaters are close, the one with the longer record edges ahead, so a two-tournament hot streak doesn't sit above a proven career on thin evidence. The display score shows what you earned. The rank reflects how sure we are of it.

The five tiers

Tiers appear as section dividers on the leaderboard and as colored indicators on each row. They are not assigned per person. They fall where the score boundaries land in the current field.

Champion tier (82.5+) is roughly the top 2% of rated debaters. These are debaters who could credibly semifinal or win at any tournament they attend. They dominate the strongest fields consistently.

Outrounds tier (80 to 82.5)covers the top 2 to 10%. These debaters break regularly at major IVs and beat most of the field, but don't consistently beat the Champions.

Bubble tier (77.5 to 80)is the top 10 to 25%. This is where break uncertainty lives. These debaters sometimes break, sometimes don't, and their tournament outcomes depend on draws, pairings, and form on the day.

Developing tier (73 to 77.5)covers the 25th to 75th percentile. Most competitive debaters fall here. They're working toward the bubble, attending tournaments regularly, and improving.

Novice tier (below 73)is the bottom 25%. These are debaters early in their competitive careers or with thin records. The tier isn't a ceiling. It is where everyone starts.

The distribution chart

The histogram on the leaderboard shows where every rated debater falls on the composite-rating distribution, color-coded by tier. If you're signed in, a "You" marker shows your position. Most debaters cluster in the center (Developing tier). The tails are thin because very few debaters are either at the extreme top or the extreme bottom of the rated field.

How the debater rating is built

What it ranks

Every BP debater we have data on. The rating itself is built from public Tabbycat tabs.

The two signals

Speaker rating. Each speaker score gets z-scored against the rest of that tournament's field. That kills the difference between a tab where the average judge gives 73s and one where they give 78s. We then take a recency-weighted average per debater on an 18-month half-life. A 2020 result counts about 12% of a 2026 one. Outcome rating. A Massey rating built from BP team-points per round. Each pairing contributes a margin. Strength of schedule comes baked in: beating high-rated opponents moves your rating more than beating low-rated ones. The two ratings agree most of the time. When they disagree, the gap says something real. A strong speaker on weak teams ends up with a high speaker rating and a middling outcome rating. A consistent room-winner whose speaks are workmanlike ends up the opposite.

The 0 to 100 rating and the ± band

We blend the two signals 50/50 and map the result onto a 0 to 100 scale that spreads the field out, so the top is actually distinguishable instead of a wall of 100%s. Next to each rating sits a ± band. The band is our uncertainty: it's wide when we only have a couple of tournaments on a debater and tight when we have a long record. We rank by the rating, and lean on the confidence band only when two debaters are close: the one with the longer record edges ahead. So a debater with a sky-high score off two tournaments doesn't leapfrog a proven debater on thin evidence, and we do that without lowering anyone's number. The 50/50 blend weight is a default, not a validated optimum.

Inclusion and how we treat thin records

Any debater with 2 or more cached BP tabs gets a row in the database. The displayed board shows 3 or more, except the live 6-month board, which shows 2 or more because recent activity matters more there. We deliberately do NOT mark a debater's rating down for having few tournaments. Their score stays at face value. What changes is the ± band: fewer tournaments means a wider band, and we rank by the cautious end of it. So a genuinely strong debater who only has two or three tabs still shows a high score, plainly flagged as uncertain, rather than being quietly penalised. That matters for people who can't fly around the world chasing tournaments. The one-tournament wonders panel below the board holds the sub-3-tab names. They're often dinos who dominated an earlier era and resurfaced for a pro-am or one sharp tab. It's also a CTA: if you have a tab one of them competed at which we don't have, send it our way.

Why Massey, not Plackett-Luce

Plackett-Luce (PL) is the textbook model for "rank these four teams" data. It's a generative model: the 1st-place team is drawn from a softmax over team strengths, then 2nd from the remaining three, and so on. We tried it. Two consecutive PL runs on identical code and data produced completely different top-5 rankings, with zero overlap. The solver was finding different local optima or hitting the iteration cap before converging. Outcome rating ranges varied by 0.44 points between runs. Massey on the same data produced stable, differentiated rankings. The deeper reason PL struggled: it assumes judges decide by sequential elimination. BP judges don't work like that. They think in margins. "Team A was clearly first, Team D clearly last, B and C were close." The deliberation assigns quality scores. It doesn't run a sequential pick. Massey fits that process. It estimates latent quality scores which explain observed margins, rather than modelling a selection procedure the judges never ran. What matters is matching the process which generated the data. Defending a model in the abstract is a separate exercise. We archived PL and stayed with Massey.

How well it predicts

We ran a held-out validation: trained the ratings on BP rounds before 2024, then predicted 2024 and 2025 rounds the model had never seen. It beat a random baseline by a clear margin, and the gap between in-sample and held-out accuracy was small, so the model isn't just memorising the rounds it was built on. We check it two ways, because they answer different questions. One asks "is the rating signal real on rooms where we already know the teams?" The other asks "how well does it predict across the whole recent field, including the many rooms where every speaker is too new to have a rating yet?" The first is the cleaner test of signal. The second is the honest test of deployment. It clears both.

Live (6mo) vs the other eras

The Live (6mo) tab uses a sharper recency curve than the multi-year eras above. A tab in the last 6 months counts at full weight. A tab between 6 and 12 months old ramps linearly down toward zero. Anything older than 12 months drops out entirely. The other eras (recent 5 years, postPandemic, yearly) keep the 18-month half-life curve so older tabs still register at half-weight or so. Live is meant to reflect THIS season's form, not a career average, so the cliff sits closer. Practical effect: if you tabbed at a strong tournament 8 months ago, that tab now contributes about a third of its full strength to your Live rating. Six months earlier it counted in full; in another 4 months it'll be gone. The window slides forward continuously with no abrupt drop-off the day a tab turns 6 months old.

What it doesn't do

Only counts BP tournaments we have a snapshot for. Pre-2021 majors are still being indexed. Names are matched by a sorted-token key to handle word-order variation, but a debater who appears under different middle names across tournaments can still split into two rows in a handful of cases. The 50/50 blend weight is a default, not a validated optimum.

How the judge rating is built

Why judging is so hard to rate

Debaters leave measurable signal every round: speaker scores, team points, wins and losses against specific opponents. Judges leave almost none. There's no objective measure of how good a call was. The only reliable signal is what the CA did with you: which room you got, whether you chaired or wung, whether you broke, and how deep you went. Even that signal is messy. CAs allocate for reasons that aren't all about quality. Sometimes they panel a judge in a weak room to test them, not reward them. Sometimes a judge only sits four rounds because they showed up to help a friend who's CA-ing and weren't free for the rest, so they don't break despite being trusted. Allocation runs on personal trust, so a CA's friends end up on their panels, and that's often because those people are genuinely good, not because of favouritism. Some judges are quiet workhorses who deliver a correct call in a mid room every single time and never need testing or a marquee seat. At some IVs the final has to be chaired by someone from the host institution, so a GF chair is sometimes a local rule rather than a global signal. And CAs don't usually judge deep rooms at their own tournaments, because they're running the tab. So the signal is noisy. We use it anyway, because it's the only consistent one there is, and we lean on the pattern across many tournaments rather than any single seat.

What we actually score

Each tournament, we look at the deepest seat a judge earned. A grand final chair sits at the top, then a grand final panel or a semi chair, on down through the quarters and octos to the early outrounds. Being Chief Adjudicator scores at the top too: you don't get handed a tournament's adjudication unless you're trusted to chair its biggest room. Chairing counts for more than sitting as a wing, because leading the deliberation is its own kind of trust. A seat in a bubble room counts as well, even though it's a prelim, because CAs hand-pick who they trust with the rounds that decide who breaks. Judging only prelims, with no bubble room and no outround, doesn't score on its own.

Stronger tournaments count for more

A grand final chair at WUDC is not the same as a grand final chair at a small novice IV. So every seat is weighted by the tournament: how strong the field was, and how prestigious the tournament is. Depth at a major outweighs the same depth at a weaker tab. The scarcity of the seat matters too. At a tournament where only a handful of judges make outrounds, breaking is worth more than at one where half the pool does. That keeps a WUDC grand final scarce even though plenty of people break the tournament overall.

From seats to one number out of 10

We add up a judge's best tournaments, leaning on their strongest results rather than counting every weekend equally. One big tournament can't be diluted by a busy season of small ones, and one great weekend can't carry a thin record on its own. Judges who are also strong debaters get a small bonus, since the ability to win rounds tracks the ability to call them. The total maps onto a scale out of 10, the way debaters already read adjudicator feedback: a trusted wing sits in the middle, a chair higher, and the chairs of finals approach 10. Judges on similar records get similar scores on purpose, because a band of equally-trusted judges genuinely is equally good.

What it doesn't do

It can't tell a CA who tested you from a CA who trusted you. It can't see that you would have chaired the semi if you'd been free for rounds 7 to 9. It can't see internal feedback scores either: Tabbycat stores the CA team's private adjudicator ratings, but the API doesn't expose them. If it did, they'd be a strong signal, though one that shifts between CA teams. What it can do is read the pattern across many tournaments and many CA teams. One good tournament could be luck or a friendly CA. Ten of them, at fields of varying strength, with bubble rooms and outround seats, almost certainly isn't. One weekend is a data point. A career is a signal.

How tournament strength is scored

The question we're answering

Tournament strength should answer one specific question: how hard is it to break at this tournament if you're aiming for it? This isn't the same as "how impressive is the field on paper." A 400-team WUDC has the most elite debaters in absolute count, but power-pairing isolates the bubble from those elites for most of the tournament. By the time you're fighting for the last break spot, you're likely in a room with other 14-point teams, not with the world champions running away at the top. Sharper, smaller tournaments like LSE Open or Doxbridge concentrate the elite in a narrow break field. There's nowhere to hide. After two or three rounds you're in their rooms whether you like it or not. The bubble path is genuinely harder even though the absolute count of WUDC-level debaters is lower.

What the bubble is, exactly

A team is on the bubble entering the last two rounds if their break/no-break status still depends on those last two rounds. In BP you can earn 0 to 6 team points in two rounds (two 4ths to two 1sts). So if break_cut is, say, 17 points and your team has 11 to 16 points after round 7 of a 9-round tournament, your break fate is undecided. You could still make it. You could still miss it. That's the bubble. Teams who have already clinched the break (already at break_cut or above) aren't on the bubble. Teams who are mathematically out (below break_cut minus 6) aren't either. The bubble is the contested middle.

How we measure path difficulty

For each bubble team, we look at the opponent teams they played in their last two prelim rounds. That's up to 6 opponent teams (3 per room in BP). For each opponent, we look up the speakers' debater ratings. We take the rolling average of each opponent speaker's ratings from the 3 tournaments before this one and the 3 after. A 6-tournament window centered on the tab. This catches "current form at the time" rather than career averages which include speakers' future achievements. We average across all opponent speakers a bubble team faced in their last two rounds. That's the team's path-strength number. We then take the median across all bubble teams at the tournament. That's the tournament's pathRoll. A higher pathRoll means the bubble had to play harder opponents to break. That's the headline difficulty.

Why we use a rolling 6-tournament window, not lifetime averages

A speaker's "rating" should reflect their level at the time they competed at this tournament, not their level five years later when they've made the WUDC final. Using career averages would inflate the strength of any tournament which happened to feature someone who got famous afterwards. Using only past observations would miss late-blooming speakers whose strength wasn't visible yet. Three before and three after the tab in question is a fair compromise. It captures recent form without leaning too far into the future.

Why round robins use a different formula

At a round robin like HWS RR, every team plays every other team across the prelims. There's no power-pairing. So every bubble team's opponents are identical: the rest of the field. That means the "bubble path" question collapses. The path is the field. So for round robins we use fieldStrength: the mean rolling rating of all speakers in the field. Same idea, simpler math, honest about what the metric is measuring. At HWS RR specifically, this means difficulty is essentially the average strength of the 16 best teams in the world. Those tabs sit at the top of the difficulty ranking on this metric, where they belong.

What this changes vs the old formula

The old formula used five signals about who was in the field: top-decile attendance count, recent-WUDC elite count, AC quality, etc. All measured field composition. None measured what the bubble actually played against. With the new formula, WUDC drops sharply in the rankings. The deepest field in absolute terms doesn't matter when the bubble is well-protected from the elite by power-pairing. Doxbridge, LSE Open, KCL Open, and similar sharp invitationals rise. HWS Round Robin tabs leap to the top because every team plays the entire elite field, with no protection.

What it doesn't do, honestly

This metric only works on tournaments where we have pairings data. For power-paired tabs, we need to know who faced whom in the last two rounds. Tabs which died with their Heroku dynos before we could fetch them are missing entirely. We don't adjust for motion quality, how the tournament was run, or how good the judging was beyond what it implies for speaker ratings. The metric is one question. "How hard was it to break here?" That's all. It doesn't pretend to be anything broader. The bubble itself is defined mechanically, not from final standings. If a tournament's break math is unusual (very small fields, unusual break sizes), the bubble might be sparse or empty. We need at least 2 bubble teams with enough rated opponents to compute a meaningful median. Tabs without enough signal get no strength score rather than a noisy one.

How this compares to other rating approaches

Why this section exists

People ask why we don't use Elo. Or TrueSkill. Or the discrete-class system going around the Facebook groups. Fair questions. This section is for the people who care about the math choice, not just the ranking output. The short version: every method that isn't broken produces a roughly similar top of the leaderboard. The arguments are at the margin. Reading this section helps if you want to understand where the margin is and what each method actually does. Reading it isn't required to use the leaderboard.

Elo

Elo is the chess rating you've heard of. Each player has a number. After every game, the winner gains a few points and the loser loses the same amount. The amount you gain or lose depends on how unexpected the result was. Beat someone way above you and you jump. Lose to someone way below and you tank. It works well in chess because chess is 1v1, the matchings are sequential, and the only signal is win or lose. BP isn't any of those things. Each round has four teams of two debaters, ranked 1st through 4th, with separate speaker scores carrying their own information. You can hack Elo to handle ranked outcomes (treat the four-way result as six pairwise wins and losses), but you're throwing away the speaker score signal and ignoring strength of schedule. We tried it on a sample. The ratings drift toward whoever showed up at lots of low-bar tabs because every "win" against a soft opponent still costs them points and rewards you. Massey handles strength of schedule directly. Elo doesn't.

Glicko-2 and TrueSkill

Same family. Both are Bayesian extensions of Elo. You track not just a player's rating but also an uncertainty around that rating. New players have wide uncertainty bands. Established players have narrow ones. After each match the system updates both numbers. Useful if you want confidence intervals on the leaderboard. We don't currently surface them. Instead we use Bayesian shrinkage on the Massey side: low-tab debaters get pulled toward zero automatically. Same idea, different exposure. If we ever want to publish "this debater is somewhere between top 10% and top 25% with 90% confidence", Glicko-2 would be the natural fit. For now, shrinkage plus a provisional flag covers it.

Plackett-Luce

The textbook model for "rank these four teams" data. It's a generative model: the 1st-place team is drawn from a softmax over team strengths, then 2nd from the remaining three, and so on. We tried it. Two consecutive runs on identical code and data produced completely different top-5 rankings, with zero overlap. The solver was finding different local optima or hitting the iteration cap before converging. Massey on the same data produced stable rankings. The deeper reason it struggled: PL assumes judges decide by sequential elimination. BP judges don't work like that. They think in margins. "Team A was clearly first, Team D clearly last, B and C were close." The deliberation assigns quality scores. It doesn't run a sequential pick. Massey fits that process. PL doesn't. We archived it.

Discrete-class scoring systems

In 2024 someone proposed a class-based scoring system for debating CVs. Every tournament gets sorted into one of eleven discrete classes (S-WUDC at the top, S-E at the bottom). Each (class, achievement) pair has a hand-set point value. Your career score is the average of your top-N achievements, with ghost copies padding the lower slots. We ported it for a side-by-side comparison and then killed the surface. The math itself is reasonable and reads like a debating resume, which is part of its appeal. "I broke at WUDC, SF at EUDC, top-20 at Cambridge IV" maps cleanly to discrete classes and points. The problems are mostly philosophical. Class boundaries are arbitrary (why exactly 72 WUDC-breaker speakers for S-AAA+, why not 75?). The point values are someone's opinion. Two tournaments inside the same class can have very different actual strength, which our continuous strengthZ catches and the discrete system flattens. And running it alongside our composite created the obvious question: "which one is the real score?" That question doesn't have a clean answer because the two systems are measuring different things. If we ever resurface the class view, it'll be on a page like this one as an educational comparison. Not on individual profiles.

So which is right?

None of them are right. They're different ways of summarising the same underlying performance. The interesting question isn't "which algorithm". It's "which question are we asking". Our composite asks "how good were they, on average, recently". The discrete-class CV asks "how impressive is their achievement list". Elo would ask "who would beat whom in a head-to-head". Plackett-Luce asks "what generative model best explains the ranks we observed". Different questions get different answers. Picking a method without thinking about the question is just picking a method. The reassuring part: the same 30 or so people end up at the top across every reasonable method. Whatever you build, the WUDC champions and finalists of the last decade come out near the top. The choice of algorithm shifts ranks by 1 to 3 spots in the middle of the leaderboard. The fights at the margin are the only ones the algorithm choice actually decides. And those fights are usually inside the uncertainty band of whatever method you used. That doesn't mean any method works. A bad method can produce nonsense (we have one in our git history: Plackett-Luce's non-converging top-5s). It does mean that once you're past the "this method is broken" floor, the disagreements between methods at the top are smaller than people instinctively believe.

Why we wrote this

The methodology page exists so anyone can check the math. The math itself isn't where the interesting choices are. The interesting choices are upstream: which tournaments count, how to handle recency, what to do with the 1-tab-elite, whether to display percentile or raw rating, whether to publish provisional names at all. Those decisions move the leaderboard more than the choice between Massey and Plackett-Luce ever could. When you read these ratings, treat the algorithm as a detail. The interesting questions are about the questions.

How the leaderboards relate

Shared data, shared recency

The three leaderboards share source data but use independent strength signals, because a tournament's debater difficulty and judge difficulty aren't the same thing. A tournament has TWO strength scores: • Debater-bubble strength. How hard the bubble teams' last-2-rounds path was. Used as the weighting multiplier in the debater rating computation. Shown in the tournaments pane on the leaderboard. • Judge-field strength. Mean rating-percentile of every judge in the adjudication pool. Used as the weighting multiplier in the judge rating computation. Not currently surfaced as its own ranking, but it's why the judge ladder reflects judging-field quality rather than debater-bubble difficulty. These can diverge meaningfully. WUDC has elite debaters AND elite judges, so both scores rank high. A smaller regional tab might have a brutal debate bubble and a regional adjudication pool: high debater strength, lower judge strength. The methodology measures each honestly without forcing them to track each other. When we recompute the leaderboards, debater ratings › tournament debater-strength › updated debater ratings (one iteration; converges within a single pass since the rolling speaker ratings don't depend on tournament strength). Judge ratings › tournament judge-strength › updated judge ratings (we iterate this 2-3 times for convergence since judge field strength uses prior judge ratings as input).

What none of these do

Out of scope

WSDC, AP, and other schools formats aren't rated yet. They're on the roadmap. Tabs we never captured aren't here. Most tournaments come from Tabbycat, and where one ran on another system we reconstruct what we can, but a tab that was never archived anywhere is simply missing. We don't predict future results. The math summarises what already happened, not what will. We don't make moral claims. A high rank means "appeared often at strong tabs with strong results". It doesn't mean "is a better person" or "should win their next round". We don't update in real time. The cron runs once a week. Imports between runs don't move the leaderboard until the next refresh.