M

This Discussion is now a Draft.

Once it's ready, please submit it for review by our team of Community Moderators. Thank you!

Pending

This Discussion now needs to be approved by community moderators.

New Tournament Scoring: Trade-offs, Decisions

This post is aimed at forecasters who are interested in the mathematical details and trade-offs involved in scoring. We discuss some of the trade-offs and other considerations that went into our design decisions for the new tournament scoring system.

We designed the new system with an eye toward: simplicity, consistency, and a prize distribution that balances rewarding top forecasters & encouraging participation.

Options we considered

We considered both the Baseline or Peer score. Either would let us simplify our overall scoring system by letting us deprecate the now legacy Relative scores. In either case, your tournament Rank would be determined by the sum of your scores, or “Total score”.

We considered distributing prizes proportionally to each forecaster’s Total Score, but we empirically found that this distributes prizes very flatly: all forecasters get a very small portion of the pool. We want top forecasters to receive a bigger share, to better reward excellence.

We focused on two ways to better concentrate the prizes at the top: (i) taking the exponential of the sum () or (ii) taking the square ().

This gave us four possibilities to consider in more depth:

Exponential of sum of Baseline scores Square of sum of Baseline scores
Exponential of sum of Peer scores Square of sum of Peer scores

Exponential and equivalent number of questions

Taking the exponential of the sum of your scores has a drawback: in the limit of many questions it gives all the prize money to a single user. To address this we considered adding a small complication.

Instead of having the prize proportional to the exponential of the sum of your scores:

We make the prize proportional to your average score in the tournament (divided by 100), where you get 0 if you don’t forecast a question:

Where is the number of questions in the tournament. Now, empirically this distributes the prize pool very flatly: everyone only gets a little bit. Thankfully, we’ve seen that having more questions makes the prize distribution sharper. We simulate that by introducing an “equivalent number of questions” :

This is what we will mean by “exponential with M equivalent questions” in the rest of this post. In the examples below, we used 20 equivalent questions unless mentioned otherwise.

Choice criteria

To choose between the 4 possibilities above (plus a “number of equivalent questions” if we choose the exponential) we explored how they compare across our key objectives. Specifically:

  1. It should be simple and elegant.
  2. It should moderately concentrate the prize on the top forecasters. We don’t want the top 1 forecaster to get everything, and we don’t want the top 100 to get 1% each.
  3. It should moderately punish joining late. We want forecasters to predict early, but we also want someone who joins halfway through to get a moderate share of the prize.
  4. The late joining penalty should be independent of skill. A good forecaster who joins late should be punished the same as a bad forecaster who joins late (proportional to the prize each would have gotten had they joined early).
  5. Adding an easy question should not change the distribution of prizes significantly.

Evaluating the criteria

Simplicity and Elegance

The Baseline score is slightly simpler than the Peer score. It compares each user to a fixed baseline, while the Peer score compares each user to all other users. Both are displayed prominently on each resolved question. We don’t think either the Baseline or Peer score has an edge when considering elegance.

The mathematical operation of squaring is slightly simpler than the exponential. Squaring does not require using an effective number of questions like the exponential method does, but it does require arbitrarily setting the prize of forecasters who get negative scores to 0.

As outlined in this blog post announcing the current Tournament scoring system, taking the exponential of scores is equivalent to using the softmax function, which has precedent in machine learning (and other sciences). And it can be derived from the Kelly criterion.

Overall we feel that the square method is simpler, but the exponential method is more elegant.

Prize concentration

Ideally we want to give most of the prize money to the top few forecasters. We still expect some variation: if one forecaster is excellent and all the other ones dart-throwing chimps, we would like the excellent forecaster to get roughly all the prize. Conversely if all forecasters make the exact same predictions, we would like them to all receive the same prize. But neither outcome should be the norm for tournaments.

Thankfully this criterion can actually be evaluated on real data, since we have many past tournaments. For each, we computed how the prize money would have been distributed among forecasters if the tournament had used one of the new scoring methods under consideration. The results are summarised in the following table: for each past tournament of interest, we plot the fraction of the prize pool each user would have gotten vs. their rank (click the image to open full size):

As expected, for the exponential methods, more equivalent questions leads to more prize for the top forecasters.

More surprising, for the square methods, the Peer score concentrates prize more, while for the exponential methods, it’s the Baseline score that concentrates prize more.

Next we looked at two statistics that we think are critical: the proportion of the prize pool that went to the best forecaster in the tournament, and that went to all the top 10 forecasters. Here are the averages over all the above tournaments for each statistic, and our thoughts for each:

Scoring method Average prize that goes to the best forecaster Average prize that goes to the top 10 forecasters Thoughts
Current scoring 41% 61% too concentrated
Square Baseline 12% 59% not concentrated enough
Square Peer 21% 73% balanced
Exponential Baseline
(5 equivalent questions) 4% 23% not concentrated enough
Exponential Baseline
(10 equivalent questions) 14% 52% not concentrated enough
Exponential Baseline
(15 equivalent questions) 26% 76% balanced
Exponential Peer
(10 equivalent questions) 4% 16% not concentrated enough
Exponential Peer
(20 equivalent questions) 15% 40% not concentrated enough
Exponential Peer
(30 equivalent questions) 31% 65% balanced

Which leaves the square of Peer scores, the exponential of Baseline scores with 15 equivalent questions, and the exponential of Peer scores with 30 equivalent questions as passing this criterion.

Lateness penalty

We want to reward early participation, which is the same as punishing late joining. However, we don't want to punish late joining too much: it should still be worth a forecaster’s time to join a tournament halfway through.

Below is a table listing the prize a user gets when joining late, compared to the prize they would have gotten if they had joined from the start. We compare the square of the score and the exponential with 10, 20 and 30 equivalent questions. The examples assume a good forecaster, who would have gotten a Relative score of +0.025 using current scoring, a Peer score of +20 and a Baseline score of +40 per question on average if they had joined from the start.

join 10% through (90% coverage) join 50% through (50% coverage)
Current scoring get 86% get 39%
Square of Baseline score get 81% get 25%
Square of Peer score get 81% get 25%
Exponential of Baseline (5 equivalent questions) get 75% get 30%
Exponential of Baseline (10 equivalent questions) get 56% get 9%
Exponential of Baseline (15 equivalent questions) get 42% get 3%
Exponential of Peer (10 equivalent questions) get 75% get 30%
Exponential of Peer (20 equivalent questions) get 56% get 9%
Exponential of Peer (30 equivalent questions) get 42% get 3%

The two square methods, the exponential of the Baseline score with 5 equivalent questions, and the exponential of the Peer score with 10 equivalent questions leave more than 20% of the prize for joining halfway through. All the other ones are extremely punitive for joining even 10% of the way through, so we ruled those out.

Easy question invariance

An “easy question” is a question like “Will the sun rise tomorrow”, where everyone predicts 99.9%, and gets a Baseline score of +100 and Peer score of 0.

Imagine a tournament where 30 questions have already resolved. A user received on average a Baseline score of +40 and a Peer score of +20, for totals of +1200 and +600 (resp.).

How does their Take (which determines the amount of prize they get) change if we add 5 easy questions (as defined above)?

Before the 5 easy questions After the 5 easy questions
Total Baseline score 1200 1700
Total Peer score 600 600
Take (baseline, square) 1440000 2890000
Take (peer, square) 360000 360000
Take (baseline, exponential) 2981 16552
Take (peer, exponential) 55 31

With Baseline scores, adding easy questions has a large impact, since they add hundreds of points to user total scores. The exponential methods are also impacted since adding questions changes the denominator in the “effective number of questions” calculation.

Only the square of the Peer scores is left unchanged by adding easy questions. We view this as a strong argument for using Peer scores.

Decision matrix and conclusion

Here is a decision matrix summarizing all the previous criteria:

Looking at the above matrix:

  • We reject the square of the Baseline score, because it doesn’t concentrate the prize enough on the top forecasters.

  • We reject the exponential of the Baseline score, since it:

    • punishes more skilled forecasters more for joining late,
    • has no sweet spot of “number of equivalent questions” where it both concentrates the prize appropriately, and punishes lateness fairly.
  • We reject the exponential of the Peer score, since it:

    • punishes more skilled forecasters more for joining late,
    • has no sweet spot of “number of equivalent questions” where it both concentrates the prize appropriately, and punishes lateness fairly.

The remaining option is the square of the Peer score. This is what we chose for our new tournament scoring system. It is not the most elegant, but it ticks all of our other boxes.

We look forward to reading your feedback and questions below!

Categories:
Metaculus