We have some big news: we’re massively overhauling our tournament scoring system. Why this change? We’ll get into all the details below, but at a high level it’s because we’re striving to create a fairer system that better rewards forecaster skill. Getting the scoring system right is crucial for forecaster motivation, and ultimately we want to be the absolutely best platform for forecasters. We believe this will ultimately lead to even more accurate forecasts with terrific commentary, making Metaculus the best place to learn what is going on in the world while helping the world make better decisions.
This post explains how the new Metaculus tournament scoring rule works via a simple example. Our new system builds on the incentive-compatible Kelly betting rule that was created in April 2021 and was used to calculate the leaderboard in the recently completed Trade Signal Tournament. Our new system has the same basic structure with some additional features to better reward the best forecasters.
Note that the details provided below are also provided via this video explainer.
The old tournament scoring system had a weakness: a simple bot that copies the community prediction on every question ends up doing quite well and winning real money, while providing no value. In the Trade Signal Tournament, such a bot would have finished in first place and won around 15% of the prize pool. More alarmingly, if someone made such a bot (which we believe current users can easily do) and shared that bot with 10 friends, those 10 bots would almost certainly consume the majority of the prize pool—again, while providing no value.
Our updated tournament scoring system aims to close that loophole. The community median forecast is hidden for a period of time at the start of a question, so forecasters can’t piggy-back on the work of others. A simple bot that forecasts an ignorance prior when the community median is hidden (like 50% on a binary question) will score quite poorly, allowing the best forecasters to get a sizable scoring lead with their skill during the hidden period. The length of time that the median is hidden, and the weight given to forecasts during the hidden period, may vary from question to question and from tournament to tournament. We plan to experiment to see what works best and to tailor the tournament parameters to match the subject matter of a given project.
If Metaculus’s only goal were to reward forecasting accuracy then we could hide the community median at all times. But, Metaculus also aims to help our partners and the world make better decisions based on the most accurate forecasts. We believe that the new tournament scoring system balances these competing interests, rewarding individual skill while also providing a public service.
Our scoring system is a work in progress. As we run more tournaments, we will continue to adjust the tournament scoring framework and parameters as we learn what works best. We’re always eager for feedback, small suggestions, and bold new ideas. We hope to hear from both forecasters and forecast consumers.
A Simple Scenario
Let’s explore a small concrete example to illustrate the ideas behind our scoring approach. Consider a simple tournament with the following assumptions:
- The prize pool is $1,000.
- The tournament has 3 questions.
- Each question is scheduled to be open for 4 days. For the first 2 days the community median is hidden and for the final 2 days it is publicly visible.
- There are 3 forecasters: A, B, and C.
- There is a bot that mechanically forecasts the community median whenever it is visible and does not forecast otherwise.
This tournament is simple enough to illustrate every step in a spreadsheet. Readers who are interested can follow along here. We hope that by the end of this piece readers will be able to interpret the recently launched leaderboards and understand the new tournament scoring system.
Question 1: Scoring a Binary Question
First, let’s consider a binary (Yes/No) question: Will the Tokyo Olympics be held in 2021? (We now know the resolution is Yes).
The table below shows the daily forecasts made by each forecaster. On day 1, forecaster A thought there was a 10% chance the Tokyo Olympics would take place in 2021. This forecast was carried forward until day 3, when forecaster A updated their forecast to 55%, where it remained until the question closed on day 4.
Forecaster B didn’t make a forecast until day 2. They forecast 90% and did not update during the life of the question. Forecaster C steadily increased their forecast over the 4 days from 20% to 35%.
The table also includes the forecasts made by the bot described above. The simple bot does not forecast on day 1 or 2, since the community median is not publicly visible. On days 3 & 4, the bot copies the median forecast of 55%.
Given the daily forecasts and question resolution, the question score (s1) for each forecaster can be calculated. The question score is a measure of the accuracy of a forecast, or how close the forecast was to the actual outcome. It is quite important to use a mathematically proper scoring rule, so that forecasters are incentivized to forecast their true beliefs. There are a handful of proper scoring rules, but Metaculus uses a log score.
On a binary question that resolves Yes, a forecast’s accuracy score is the natural log of the forecast divided by the group median. If I forecast 80% on an event that did happen while the community median forecast was 40%, then my log score is log(80%/40%) = log(2) = 0.693. A positive score indicates that I was closer to the correct outcome (Yes = 100% in this case) than the group.
The table below shows the log score (natural log of the forecasts divided by the median). Forecaster A’s day 1 forecast (10%) divided by the day 1 median forecast (15%) is log(0.1/0.15) = -0.41. In this case the score is negative, since 10% is further from the outcome (100%) than the median forecast of 15%. When a forecast matches the median, the daily score is log(1) = 0.
On days when a forecaster made no forecast, they receive a value of log(f/m) = 0, as if they had input the median forecast for that day. Accordingly, the bot always has log(f/m) = 0 for all days on all questions.
To calculate each forecasters’ score for question 1 (s1), we take a weighted average of the daily log(f/m) using the daily score weights. The daily score weights are chosen by the tournament organizer before a question is launched and are publicly available. In the example above each day is weighted equally at 25%. (Later, we will show how the parameter that controls the daily score weight, s_weight, can be tuned to put more or less weight on forecasts when the median is hidden, and we discuss why this might be done.)
The score (s1) measures forecast accuracy. In this case forecaster B’s 90% forecast was closest to the outcome, so they received the highest score of 0.566. All else equal, the higher a forecaster’s score the larger the prize they will win.
Finally, we calculate a metric that is meant to reflect a forecaster’s participation in question 1. We call this the coverage (c1). The coverage is meant to reward forecaster participation in a question and is factored into the prize calculation.
In the table below, each forecaster received a 1 if they were actively forecasting on that day and a 0 otherwise. Forecaster A participated every day, so they have all 1s. Forecaster B did not forecast on day 1 so they have a 0 under day 1, but a 1 for the rest of the days. Note that the bot did not forecast on day 1 or 2 when the median was hidden, so it received 0 on those days.
To calculate the coverage for question 1 (c1) we take a weighted average of the daily participation with the coverage daily weights, which in this case are equally weighted at 25%. Forecasters A & C each had a coverage of 100% since they participated every day. All else equal, the higher a forecaster’s coverage the larger the prize they will win.
Like the score weight, the daily coverage weight is determined by the parameter c_weight that is set by the tournament organizer and shared with forecasters. The section below titled “Battling the Bot” explains how it can be tuned to ensure the simple bot does not win any prize money.
Now that we’ve calculated the log score (s1) and coverage (c1) for question 1, a binary question, let’s consider how to calculate these for question 2, a continuous numeric question.
Question 2: Scoring a Continuous Question
To start, let’s review how log scoring works with a concrete example of a continuous question: What will GDP for the next quarter be? In this case, forecasters submit probability distribution functions (PDFs) rather than probabilities. The plot below shows the PDFs for forecasters A, B, and C on day 1. Forecaster A’s PDF is centered on 1.4. Forecaster B’s PDF is centered on 2.5 and is much wider than A’s PDF. Forecaster C’s PDF is centered on 2.4 and is much narrower than the other forecasters’ curves.
For a continuous question the amplitude of a forecaster’s curve at the resolution value determines their forecast (f), and therefore their log score. Assuming the resolution value is 2, then the forecasts (f) for forecasters A, B, and C are based on the amplitude of their PDFs at that point, or 0.18, 0.36, and 0.09 respectively. These values are shown in the table below for day 1.
The table shows that forecasters A & C did not change their forecasts, while forecaster B updated every day. The bot copied the median pdf, so it received the median forecast every day. Given the forecasts (f) for each day and the median (m), the log scores for question 2 (s2) are calculated exactly the same way as question 1. The table below shows this calculation.
Interestingly, forecaster C’s 50th percentile forecast (2.4) was closer to the true outcome (2.0) than the 50th percentile for either forecaster A (1.4) or forecaster B (2.5), yet forecaster C got a worse score. How can that be the case? The reason is because C’s pdf was so narrow that the amplitude was the lowest at the resolution value of 2.0. Said another way, C assigned a lower probability to the outcome than either A or B, even though the center of C’s pdf was closer to the outcome.
Coverage is calculated exactly the same for a numeric question. For completeness, below is the table for question 2.
Question 3: Scoring a Question that Closes Early
Our final example considers a binary question that closes early. On Tuesday, the question launches asking: Will the S&P 500 close above 4500 by the end of the week? On Thursday (day 3), the S&P closes above 4500. The question closes early and no forecasts are collected on Friday (day 4).
The following table shows that Forecaster A forecast 30% on days 1 & 2, then withdrew from the question on day 3. Like a forecaster who joins a question late, a forecaster who has withdrawn from a question is imputed the median forecast, but receives no credit for coverage on those days. (We plan to add the ability to withdraw in the future.) Forecaster B starts forecasting 10% on day 2, while forecaster C makes no forecasts. Consistent with joining a question late or withdrawing, forecaster C is imputed the median forecast for all days and receives 0 coverage and 0 log score for this question.
To calculate the log score (s3), we take the weighted average of log(f/m) for each day including day 4 when all forecasters receive a 0 value. (Averaging over only the first 3 days would make the scores improper mathematically.)
One way to think about scoring for a question that resolves early is to imagine that the question remained open for day 4 and all forecasters were imputed the outcome (100% in this case) since it was known on day 4. In this thought experiment, the scores would be exactly as above.
Finally, the table below shows the coverage for question 3. Note that forecaster A receives 50% coverage because they withdrew halfway through the question. Forecaster C receives 0% coverage since they didn’t participate at all. Day 4 is treated as the known outcome being imputed to all forecasters, so they all receive 0 participation for that day.
Finally, we are ready to see how the question scores and coverage are combined in the tournament leaderboard and how to calculate prizes.
The table below shows the log scores (s1, s2, s3) and coverages (c1, c2, c3) for each forecaster on each question. A forecaster’s log score (s) is the sum of their scores on each question. A forecaster’s coverage (c) is the average of their coverage over all questions. If a forecaster doesn’t participate on a question, they receive 0% coverage like forecaster C on question 3. It is also worth noting that the bot always gets a score of 0, since it never deviates from the median.
The leaderboard below shows how S is calculated for each forecaster where S = c * exp(s). A forecaster’s S is a combination of their forecast accuracy (s) and their participation (c). By taking the exponential of s, we are giving the most accurate forecasters more of the pie. S is what determines the relative prizes among forecasters. Because forecaster B was both more accurate than any other forecaster and had a high coverage, forecaster B will receive the highest proportion of the prize pool.
The leaderboard also shows each forecaster’s Take, or the percentage of the prize pool they win. Take is calculated as a forecaster’s S divided by the sum of all forecasters’ S values. This is often referred to as the softmax of S in a machine learning context. In the table we see that B has an S value of 4.78 and the sum of all forecasters’ S values is 6.14, so B’s Take is 4.78 / 6.14 ~ 78%.
Finally, the last column shows how big each forecaster’s prize is for the tournament. In this case, the prize pool is $1,000 and B wins $779, more than half. This makes sense because B was the most accurate forecaster in the tournament and participated for most of the time. Annoyingly, the bot received a prize of $68. In the next section, we’ll explore how this might be reduced.
Battling the Bot
First, it is worth noting that the original Metaculus tournament scoring system shows the community median at all times. Had that been the case in our example above, the bot would have received the same score (s) = 0, but its coverage would be around twice as big so it would win around twice as much.
For the recently completed Trade Signal Tournament, as the community median is visible at all times, the bot would have finished 1st place and won around 15% of the prize pool, so this is not just a theoretical exercise. Simply put, the median forecast combines the wisdom of many individuals and is a difficult benchmark to outperform, which is exactly what Metaculus is trying to achieve! A simple bot can take advantage of this powerful signal.
By hiding the median for half of the time in our example above, we have already reduced the bot’s prize. Is there more that we can do? There is, and that is the motivation behind the introduction of the coverage weight (c_weight) and score weight (s_weight). Let’s explore how tuning those parameters affects the bot.
The table below shows two parameters, s_weight (score weight) and c_weight (coverage weight), that are chosen by the tournament organizer. The s_weight parameter determines the daily weights used when we calculate scores for each question. In the example above we set s_weight = 50%. This means that 50% of a forecaster’s score is determined by their forecasts when the median is hidden. If we set s_weight =100%, then only the forecasts made while the median is hidden would factor into a forecaster’s score. Similarly, c_weight determines how much of a forecaster’s coverage is determined by the period when the median is hidden.
In the example above all days were equally weighted for the question score and question coverage. However, if we set c_weight = 100% then a forecaster’s coverage would be entirely determined by the period when the median is hidden. In this case, the simple bot above will get 0 coverage since it never forecasts when the median is hidden. And, it is impossible to win any prize money with 0 coverage. So, setting c_weight = 100% ensures that the simple bot above would not win any money! Here is what our example leaderboard looks like when we set c_weight = 100%.
Of course, a more sophisticated bot might adapt to our new scoring system by making a 50% forecast on binary questions when the median is hidden (or a Gaussian pdf centered in the range of a numeric question) in order to boost its coverage, and then copy the median once it is visible. While this bot would indeed win some prize money, we believe that good forecasters should be able to get a sufficient scoring lead during the hidden period to rise above the bot.
In fact, we’ve done some preliminary analysis using data from the recently completed VA Lightning Round and Trade Signal tournaments to estimate how a more sophisticated bot would have done. Our analysis shows that by hiding the median for 20% to 30% of each question’s duration and setting c_weight=100%, the bot would have finished below the 50th percentile in those tournaments and earned very little prize money. (In this analysis, we kept the daily score weight equal for all days.)
We plan to adjust the parameters as we learn more about what empirically works the best, continually experimenting and continually improving. We have some more ideas in our development queue that we hope to introduce in the coming months, including forecasting teams and assigning different weights to different questions. These ideas will be explained in an upcoming Part 2 of this post.
Readers are encouraged to play around with our simple model to see the trade-offs of different parameter settings by downloading the spreadsheet here. Any number in blue can be edited by the user to see how the prizes change. Numbers in black or other colors are usually calculations and should not be edited.
We hope that this post helps the community understand how the new Metaculus tournament scoring system works. In the example above, we simplified by using only 1 forecast per day. In reality, forecasters can join a question or change their forecasts at any time, so the log scores and coverage are calculated by integrating over time (rather than summing over days). The concept is exactly the same, but the implementation is slightly more complex than our example. A real tournament will have more questions and more forecasters than our simple example, but it will use the same aggregation explained above. Also, in practice questions will have different hidden periods and overall durations.
We look forward to hearing your feedback and questions in the discussion below!