M

This Discussion is now a Draft.

Once it's ready, please submit it for review by our team of Community Moderators. Thank you!

Pending

This Discussion now needs to be approved by community moderators.

Metaculus Tournament Scoring [Updated 4.6.22]

Metaculus Meta

Updated April 6th, 2022

Today we’re simplifying and clarifying some of the terminology we use for tournament scoring to make it even easier to track your forecasting performance. The foundation of the tournament scoring system introduced in November 2021 has not changed, and tournament rankings have not changed. What has changed is that we have redefined and standardized several of the terms that appear on leaderboards and across the site to make them more consistent, as well as a better match for how these terms are used more broadly.

Specifically, we are now using the natural logarithm (instead of the logarithm base 2) for tournament leaderboards and scoreboards. You’ll notice the scores have changed, but importantly, the actual rankings and prize calculations have not changed.

Here are the quick takeaways:

  • Tournaments use natural logarithms and not logarithm base 2
  • Tournament leaderboards’ ‘Take’ and ‘Score’ will appear differently
  • However, these changes do not affect rankings or prize payouts

If you want to dive deeper, the rest of this post goes into the details of the current tournament scoring system — including a review of the core system introduced in November, the updated terms introduced today, and the necessary role of logarithms. We also provide a few simple scenarios to illustrate the most important points.

The original system

Our original tournament scoring system was created in April 2021 and built on the incentive-compatible Kelly betting rule. In November 2021, we added a hidden period (see below) to the original system to better reward forecasting skill.

That’s because the original tournament scoring system (without a hidden period) has a weakness: a simple bot that copies the community prediction on every question ends up doing quite well and winning real money, while providing no value. In the Trade Signal Tournament, such a bot would have finished in first place and won around 15% of the prize pool. More alarmingly, if someone made such a bot (which our more technically-proficient users can currently easily do) and shared that bot with 10 friends, those 10 bots would almost certainly consume the majority of the prize pool — providing, again, no value.

The Hidden Period

The addition of a Hidden Period closed that loophole. The community median forecast is now hidden for a period of time at the start of a question, so forecasters can’t piggy-back on the work of others. A simple bot that forecasts an ignorance prior when the community median is hidden (like 50% on a binary question) will score quite poorly, allowing the best forecasters to get a sizable scoring lead with their skill during the hidden period. The length of time during which the community median is hidden and the associated question coverage (explained below), may vary from question to question and from tournament to tournament. We will continuously explore what works best and tailor the parameters to match a given tournament.

If Metaculus’s only goal were to reward forecasting accuracy then we could hide the community median at all times. But Metaculus also aims to help provide our partners and the world with the most accurate, public forecasts possible. We believe that the current tournament scoring system balances these competing interests, rewarding individual skill while also providing a public service.

A Simple Scenario

Let’s explore a small concrete example to illustrate the ideas behind our scoring approach. Consider a simple tournament with the following assumptions:

  • The prize pool is $1,000.
  • The tournament has 3 questions.
  • Each question is scheduled to be open for 4 days. For the first 2 days the community median is hidden and for the final 2 days it is publicly visible.
  • There are 3 forecasters: A, B, and C.
  • There is a bot that mechanically forecasts the community median whenever it is visible and does not forecast otherwise.

This tournament is simple enough to illustrate every step in a spreadsheet. Readers who are interested can follow along here. We hope that by the end of this piece readers will be better able to understand leaderboards, scoreboards, and tournament rankings.

Question 1: a Binary Question

First, let’s consider a binary (Yes/No) question: Will the Tokyo Olympics be held in 2021? (We now know the answer was Yes).

The table below shows the daily forecasts made by each forecaster. On day 1, forecaster A thought there was a 10% chance the Tokyo Olympics would take place in 2021. This forecast was carried forward until day 3, when forecaster A updated their forecast to 55%, where it remained until the question closed on day 4.

Forecasts (f) for Question 1
Forecaster day 1 day 2 day 3 day 4
A 10% 10% 55% 55%
B 90% 90% 90%
C 20% 25% 30% 35%
bot 55% 55%
median (m) 15% 25% 55% 55%

Forecaster B started with a 90% forecast on day 2 and did not update during the life of the question. Forecaster C steadily increased their forecast over the 4 days from 20% to 35%.

The table also includes the forecasts made by the bot described above. The simple bot did not forecast on day 1 or 2, since the community median was not publicly visible. On days 3 & 4, the bot copied the median forecast of 55%.

Given the daily forecasts and question resolution, the tournament question score (s1) for each forecaster can be calculated. The tournament question score is a measure of the accuracy of a forecast, or how close the forecast was to the actual outcome. It is quite important to use a mathematically proper scoring rule, so that forecasters are incentivized to forecast their true beliefs. There is an infinite variety of proper scoring rules, but Metaculus tournament scoring is based on the Relative Log Score.

On a binary question that resolves Yes, a forecaster’s Relative Log Score is the natural log of their forecast divided by the community median. If they forecast 80% and the community median was 40%, then their relative log score is . A positive score indicates that they were closer to the correct outcome than the group.

The table below shows the daily relative log scores for each forecaster. Forecaster A’s day 1 relative log score is the natural log of their forecast (10%) divided by the day 1 median forecast (15%), or . In this case their score is negative, since 10% is further from the outcome (100%) than the median forecast of 15%. When a forecast matches the median, the relative log score is .

Relative Log Scores for Question 1
Forecaster day 1 day 2 day 3 day 4 Question Score (s1)
A -0.41 -0.92 0.00 0.00 -0.330
B 0.00 1.28 0.49 0.49 0.566
C 0.29 0.00 -0.61 -0.45 -0.193
bot 0.00 0.00 0.00 0.00 0.00

For days before a forecaster made their initial forecast, their relative log score is 0, as if they had input the median forecast for that day. Accordingly, the bot always receives a relative log score of for all days on all questions.

To calculate each forecasters’ tournament question score (s1), we take an average of their daily relative log scores. The tournament question score is a measure of forecast accuracy. In this case forecaster B’s 90% forecast was closest to the outcome, so they received the highest score of 0.566.

Finally, we calculate a metric that is meant to reflect a forecaster’s participation in question 1. We call this the question coverage (c1). Question coverage is meant to reward forecaster participation on a question and is factored into the prize calculation. In the next table, there is a 1 if a forecaster was actively forecasting on a particular day and a 0 otherwise. Forecaster A participated every day, so they have all 1s. Forecaster B did not forecast on day 1 so they have a 0 under day 1, but a 1 for the rest of the days. Note that the bot did not forecast on day 1 or 2 when the median was hidden, so it received 0 on those days.

Coverage for Question 1
Forecaster day 1 day 2 day 3 day 4 Question Coverage (c1)
A 1 1 1 1 100%
B 0 1 1 1 75%
C 1 1 1 1 100%
bot 0 0 1 1 50%

To calculate a forecaster’s question coverage (c1) we take a weighted average of the daily participation with the coverage daily weights, which in this case is 25% per day. Forecasters A & C each had a coverage of 100% since they participated every day. All else equal, the higher a forecaster’s coverage the larger the prize they will win.

The daily coverage weight is set by the tournament organizer and shared with forecasters. If the tournament organizer wanted to only count participation during the hidden period, then they would set the weights to 50% on days 1 & 2 in our example, and 0% on days 3 & 4. We’ll explore why they might do this below. (To date, all tournaments with a hidden period have had question coverage entirely determined by participation during the hidden period. )

Now that we’ve calculated the tournament question score (s1) and question coverage (c1) for question 1, a binary question, let’s consider how to calculate these quantities for question 2, a continuous numeric question.

Question 2: a Continuous Question

Let’s consider how scoring works for a concrete example of a continuous question: What will GDP be next year? In this case, forecasters submit probability distribution functions (PDFs) rather than probabilities. The plot below shows the PDFs for forecasters A, B, and C on day 1. Forecaster A’s PDF is centered on 1.4. Forecaster B’s PDF is centered on 2.5 and is much wider than A’s PDF. Forecaster C’s PDF is centered on 2.4 and is much narrower than the other forecasters’ curves.

Day 1 Forecasts

For a continuous question the amplitude of a forecaster’s curve at the resolution value determines their forecast (f), and therefore their relative log score. Assuming the resolution value is 2, then the forecasts (f) for forecasters A, B, and C are based on the amplitude of their PDFs at that point, or 0.18, 0.36, and 0.09 respectively. These values are shown in the table below for day 1.

Forecasts (f) for Question 2
Forecaster day 1 day 2 day 3 day 4
A 0.18 0.18 0.18 0.18
B 0.36 0.5 1 2
C 0.09 0.09 0.09 0.09
bot 0.18 0.18
median (m) 0.18 0.18 0.18 0.18

The table shows that forecasters A & C did not change their forecasts, while forecaster B updated every day. The bot copied the median PDF, so it received the median forecast every day. Given the forecasts (f) for each day and the median (m), the relative log scores for question 2 (s2) are calculated exactly the same way as question 1. The table below shows this calculation.

Relative Log Scores for Question 2
Forecaster day 1 day 2 day 3 day 4 Question Score (s2)
A 0.00 0.00 0.00 0.00 0.00
B 0.69 1.02 1.71 2.41 1.459
C -0.69 -0.69 -0.69 -0.69 -0.693
bot 0.00 0.00 0.00 0.00 0.00

Interestingly, forecaster C’s 50th percentile forecast (2.4) was closer to the outcome (2.0) than the 50th percentile for either forecaster A (1.4) or forecaster B (2.5), yet forecaster C got a worse score. How can that be the case? The reason is because C’s PDF was so narrow that their amplitude was the lowest at the resolution value of 2.0. Said another way, C assigned a lower probability to the outcome than either A or B, even though the center of C’s PDF was closer to the outcome.

Question coverage is calculated exactly the same for a numeric question. For completeness, below is the table for question 2.

Coverage for Question 2
Forecaster day 1 day 2 day 3 day 4 Question Coverage (c2)
A 1 1 1 1 100%
B 1 1 1 1 100%
C 1 1 1 1 100%
bot 0 0 1 1 50%

Question 3: a Binary Question that Closes Early

Our final example considers a binary question that closes early. Imagine the following question opened on Tuesday: Will the S&P 500 close above 4500 by the end of the week? If on Thursday (day 3), the S&P closes above 4500 then the question closes early and no forecasts are collected on Friday (day 4).

The following table shows that Forecaster A forecasts 30% on days 1 & 2, then withdraws from the question on day 3. Like a forecaster who joins a question late, a forecaster who withdraws from a question is imputed the median forecast, but receives no credit for coverage on those days. (We plan to add the ability to withdraw in the future.) Forecaster B starts forecasting 10% on day 2, while forecaster C makes no forecasts. Consistent with joining a question late or withdrawing, forecaster C is imputed the median forecast for all days and receives 0 question coverage and 0 tournament question score for this question.

Forecasts (f) for Question 3
Forecaster day 1 day 2 day 3 day 4
A 30% 30%
B 10% 10%
C
bot 10%
median (m) 30% 20% 10%

To calculate the tournament question score (s3), we take the average of ln(f/m) for each day including day 4 when all forecasters receive a 0 value. (Averaging over only the first 3 days would make the scores improper mathematically.)

Relative Log Scores for Question 3
Forecaster day 1 day 2 day 3 day 4 Question Score (s3)
A 0.00 0.41 0.00 0.00 0.101
B 0.00 -0.69 0.00 0.00 -0.173
C 0.00 0.00 0.00 0.00 0.00
bot 0.00 0.00 0.00 0.00 0.00

One way to think about scoring for a question that resolves early is to imagine that the question remained open for day 4 and all forecasters were imputed the outcome (100% in this case) when it was known. In this thought experiment, the scores would be exactly as above.

Finally, the table below shows the question coverage for question 3. Note that forecaster A receives 50% question coverage because they withdrew halfway through the question. Forecaster C receives 0% question coverage since they didn’t participate at all. Day 4 is treated as if all forecasters withdrew and received 0 participation for that day.

Coverage for Question 3
Forecaster day 1 day 2 day 3 day 4 Question Coverage (c3)
A 1 1 0 0 50%
B 0 1 1 0 50%
C 0 0 0 0 0%
bot 0 0 1 0 25%

Finally, we are ready to see how the tournament question scores and question coverage are combined on the tournament leaderboard.

Tournament Scoring

The table below shows forecaster A’s scoreboard. Their tournament score (-0.229) is the sum of their question scores. Their tournament coverage (83%) is the average of their question coverage. Finally, their tournament take is their question coverage multiplied by the exponential of their tournament score, or . (Note that we redefined tournament “take” to hopefully make scoring easier to understand. To calculate the % prize that a forecaster gets their take is divided by the sum of all forecasters’ take)

Forecaster A’s Scoreboard
Question Score Coverage
1 -0.33 100%
2 0.00 100%
3 0.101 50%
Tournament Score -0.229
Tournament Coverage 83%
Tournament Take 0.663

A forecaster’s tournament take determines the proportion of the prize pool that they take home. To get a high tournament take requires having both high tournament coverage and a high tournament score, so it rewards both participation and accuracy. Prizes are distributed such that a forecaster who has twice the tournament take of another forecaster will receive twice the prize. The tournament leaderboard shows the tournament score, tournament coverage, and tournament take for all forecasters, as well as their prizes. On the Metaculus site tournament leaderboards are sorted by tournament take in descending order. The final two columns on the leaderboard, “% prize” and “completion”, are intended as extra background information.

Leaderboard
Forecaster Score Coverage Take Prize % Prize Completion
A -0.23 83% 0.66 $108 11% 3/3
B 1.85 75% 4.78 $779 78% 3/3
C -0.89 67% 0.27 $45 4% 2/3
bot 0.00 42% 0.42 $68 7% 3/3
- - - 6.14 $1000 100% -

In this example, the prize pool is $1,000 and B wins $779, more than half. This makes sense because B was the most accurate forecaster in the tournament and participated for most of the time. Annoyingly, the bot received a prize of $68. In the next section, we’ll explore how this might be reduced.

Battling the Bot

First, it is worth noting that the original Metaculus tournament scoring system shows the community median at all times. Had that been the case in our example above, the bot would have received the same tournament score of zero, but its coverage would be around twice as big (since it could copy the median for the entire time) so it would win around twice as much.

For the Trade Signal Tournament, a median-copying bot would have finished 1st place and won around 15% of the prize pool, so this is not just a theoretical exercise. Simply put, the median forecast combines the wisdom of many individuals and is a difficult benchmark to outperform. A simple bot can take advantage of this powerful signal.

By hiding the median for half of the time in our example above, we have already reduced the bot’s prize. Is there more that we can do? There is, and that is the motivation behind the introduction of the hidden period. Let’s explore how this affects the bot.

In the example above if coverage is determined by participation during the hidden period, then a simple median-copying bot will get 0 coverage since it will not forecast when the median is hidden. Since it is impossible to win any prize money with 0 coverage, the simple bot above will not win any money! Here is what our example leaderboard looks like with coverage determined solely by the hidden period.

Leaderboard
Forecaster Score Coverage Take Prize % Prize Completion
A -0.23 100% 0.80 $149 15% 3/3
B 1.85 67% 4.25 $799 80% 3/3
C -0.89 67% 0.27 $52 5% 2/3
bot 0.00 0% 0.00 $0 0% 3/3
- - - 5.32 $1000 100% -

Of course, a more sophisticated bot might adapt to our new scoring system by making a 50% forecast on binary questions when the median is hidden (or a Gaussian PDF centered in the range of a numeric question) in order to boost its coverage, and then copy the median once it is visible. While this bot would indeed win some prize money, we believe that good forecasters should be able to get a sufficient scoring lead during the hidden period to rise above the bot.

In fact, we’ve done some preliminary analysis using data from the VA Lightning Round and Trade Signal tournaments to estimate how a more sophisticated bot would have done. Our analysis shows that by hiding the median for 20% to 30% of each question’s duration, the bot would have finished below the 50th percentile in those tournaments and earned very little prize money.

We plan to adjust the parameters as we learn more about what empirically works the best, continually experimenting and continually improving. We have some more ideas in our development queue that we hope to introduce in the coming months, including forecasting teams.

Readers are encouraged to play around with our simple model to see the trade-offs of different parameter settings by downloading the spreadsheet here. Any number in blue can be edited by the user to see how the prizes change. Numbers in black or other colors are usually calculations and should not be edited.

We hope that this post helps the community understand how the new Metaculus tournament scoring system works. In the example above, we simplified by using only 1 forecast per day. In reality, forecasters can join a question or change their forecasts at any time, so the tournament question scores and question coverage are calculated by integrating over time (rather than summing over days). The concept is exactly the same, but the implementation is slightly more complex than our example. A real tournament will have more questions and more forecasters than our simple example, but it will use the same aggregation explained above. Also, in practice questions will have different hidden periods and overall durations.

We look forward to hearing your feedback and questions in the discussion below!

Categories:
Metaculus