Once it's ready, please submit it for review by our team of Community Moderators. Thank you!
This content now needs to be approved by community moderators.
This essay was submitted and is waiting for review.
This post is aimed at forecasters who are interested in the mathematical details and trade-offs involved in scoring. We discuss some of the trade-offs and other considerations that went into our design decisions for the new tournament scoring system.
We designed the new system with an eye toward: simplicity, consistency, and a prize distribution that balances rewarding top forecasters & encouraging participation.
We considered both the Baseline or Peer score. Either would let us simplify our overall scoring system by letting us deprecate the now legacy Relative scores. In either case, your tournament Rank would be determined by the sum of your scores, or “Total score”.
We considered distributing prizes proportionally to each forecaster’s Total Score, but we empirically found that this distributes prizes very flatly: all forecasters get a very small portion of the pool. We want top forecasters to receive a bigger share, to better reward excellence.
We focused on two ways to better concentrate the prizes at the top: (i) taking the exponential of the sum () or (ii) taking the square ().
This gave us four possibilities to consider in more depth:
Exponential of sum of Baseline scores | Square of sum of Baseline scores |
Exponential of sum of Peer scores | Square of sum of Peer scores |
Taking the exponential of the sum of your scores has a drawback: in the limit of many questions it gives all the prize money to a single user. To address this we considered adding a small complication.
Instead of having the prize proportional to the exponential of the sum of your scores:
We make the prize proportional to your average score in the tournament (divided by 100), where you get 0 if you don’t forecast a question:
Where is the number of questions in the tournament. Now, empirically this distributes the prize pool very flatly: everyone only gets a little bit. Thankfully, we’ve seen that having more questions makes the prize distribution sharper. We simulate that by introducing an “equivalent number of questions” :
This is what we will mean by “exponential with M equivalent questions” in the rest of this post. In the examples below, we used 20 equivalent questions unless mentioned otherwise.
To choose between the 4 possibilities above (plus a “number of equivalent questions” if we choose the exponential) we explored how they compare across our key objectives. Specifically:
The Baseline score is slightly simpler than the Peer score. It compares each user to a fixed baseline, while the Peer score compares each user to all other users. Both are displayed prominently on each resolved question. We don’t think either the Baseline or Peer score has an edge when considering elegance.
The mathematical operation of squaring is slightly simpler than the exponential. Squaring does not require using an effective number of questions like the exponential method does, but it does require arbitrarily setting the prize of forecasters who get negative scores to 0.
As outlined in this blog post announcing the current Tournament scoring system, taking the exponential of scores is equivalent to using the softmax function, which has precedent in machine learning (and other sciences). And it can be derived from the Kelly criterion.
Overall we feel that the square method is simpler, but the exponential method is more elegant.
Ideally we want to give most of the prize money to the top few forecasters. We still expect some variation: if one forecaster is excellent and all the other ones dart-throwing chimps, we would like the excellent forecaster to get roughly all the prize. Conversely if all forecasters make the exact same predictions, we would like them to all receive the same prize. But neither outcome should be the norm for tournaments.
Thankfully this criterion can actually be evaluated on real data, since we have many past tournaments. For each, we computed how the prize money would have been distributed among forecasters if the tournament had used one of the new scoring methods under consideration. The results are summarised in the following table: for each past tournament of interest, we plot the fraction of the prize pool each user would have gotten vs. their rank (click the image to open full size):
As expected, for the exponential methods, more equivalent questions leads to more prize for the top forecasters.
More surprising, for the square methods, the Peer score concentrates prize more, while for the exponential methods, it’s the Baseline score that concentrates prize more.
Next we looked at two statistics that we think are critical: the proportion of the prize pool that went to the best forecaster in the tournament, and that went to all the top 10 forecasters. Here are the averages over all the above tournaments for each statistic, and our thoughts for each:
Scoring method | Average prize that goes to the best forecaster | Average prize that goes to the top 10 forecasters | Thoughts |
---|---|---|---|
Current scoring | 41% | 61% | too concentrated |
Square Baseline | 12% | 59% | not concentrated enough |
Square Peer | 21% | 73% | balanced |
Exponential Baseline | |||
(5 equivalent questions) | 4% | 23% | not concentrated enough |
Exponential Baseline | |||
(10 equivalent questions) | 14% | 52% | not concentrated enough |
Exponential Baseline | |||
(15 equivalent questions) | 26% | 76% | balanced |
Exponential Peer | |||
(10 equivalent questions) | 4% | 16% | not concentrated enough |
Exponential Peer | |||
(20 equivalent questions) | 15% | 40% | not concentrated enough |
Exponential Peer | |||
(30 equivalent questions) | 31% | 65% | balanced |
Which leaves the square of Peer scores, the exponential of Baseline scores with 15 equivalent questions, and the exponential of Peer scores with 30 equivalent questions as passing this criterion.
We want to reward early participation, which is the same as punishing late joining. However, we don't want to punish late joining too much: it should still be worth a forecaster’s time to join a tournament halfway through.
Below is a table listing the prize a user gets when joining late, compared to the prize they would have gotten if they had joined from the start. We compare the square of the score and the exponential with 10, 20 and 30 equivalent questions. The examples assume a good forecaster, who would have gotten a Relative score of +0.025 using current scoring, a Peer score of +20 and a Baseline score of +40 per question on average if they had joined from the start.
join 10% through (90% coverage) | join 50% through (50% coverage) | |
---|---|---|
Current scoring | get 86% | get 39% |
Square of Baseline score | get 81% | get 25% |
Square of Peer score | get 81% | get 25% |
Exponential of Baseline (5 equivalent questions) | get 75% | get 30% |
Exponential of Baseline (10 equivalent questions) | get 56% | get 9% |
Exponential of Baseline (15 equivalent questions) | get 42% | get 3% |
Exponential of Peer (10 equivalent questions) | get 75% | get 30% |
Exponential of Peer (20 equivalent questions) | get 56% | get 9% |
Exponential of Peer (30 equivalent questions) | get 42% | get 3% |
The two square methods, the exponential of the Baseline score with 5 equivalent questions, and the exponential of the Peer score with 10 equivalent questions leave more than 20% of the prize for joining halfway through. All the other ones are extremely punitive for joining even 10% of the way through, so we ruled those out.
An “easy question” is a question like “Will the sun rise tomorrow”, where everyone predicts 99.9%, and gets a Baseline score of +100 and Peer score of 0.
Imagine a tournament where 30 questions have already resolved. A user received on average a Baseline score of +40 and a Peer score of +20, for totals of +1200 and +600 (resp.).
How does their Take (which determines the amount of prize they get) change if we add 5 easy questions (as defined above)?
Before the 5 easy questions | After the 5 easy questions | |
---|---|---|
Total Baseline score | 1200 | 1700 |
Total Peer score | 600 | 600 |
Take (baseline, square) | 1440000 | 2890000 |
Take (peer, square) | 360000 | 360000 |
Take (baseline, exponential) | 2981 | 16552 |
Take (peer, exponential) | 55 | 31 |
With Baseline scores, adding easy questions has a large impact, since they add hundreds of points to user total scores. The exponential methods are also impacted since adding questions changes the denominator in the “effective number of questions” calculation.
Only the square of the Peer scores is left unchanged by adding easy questions. We view this as a strong argument for using Peer scores.
Here is a decision matrix summarizing all the previous criteria:
Looking at the above matrix:
We reject the square of the Baseline score, because it doesn’t concentrate the prize enough on the top forecasters.
We reject the exponential of the Baseline score, since it:
We reject the exponential of the Peer score, since it:
The remaining option is the square of the Peer score. This is what we chose for our new tournament scoring system. It is not the most elegant, but it ticks all of our other boxes.
We look forward to reading your feedback and questions below!
Once you submit your essay, you can no longer edit it.