The recent scoring and medals update is the result of an extensive process of reflection, analysis and testing. We considered many trade-offs, and made many decisions, big and small.
Ah, that makes sense. Not sure what I'd prefer, but it seems like a problem with no easy solution. — Metaculus Beta Tester
This document is intended for users who want to dig deeper into the trade-offs we considered. It is not required reading for those who just want to understand how scoring and medals work.
We will briefly go over some of the alternatives we considered, and give a quick summary of our reasoning for each decision. At the end we consider some ideas for how to iterate and build on the new foundations, including how we might attempt to score forecasts in a way that does not require frequent updates.
One-off vs. time-averaged scores
Scoring one-off predictions works well for time-limited forecasting events where all forecasters predict simultaneously. Metaculus uses time-averaged scores because they incentivize making updates, which serves our mission to provide accurate real-time forecasts to consumers.
Brier vs. log scores
Metaculus historically has been using log scores, not Brier scores — although both are proper. Log scores are more sensitive for tail events and extreme outcomes, where probabilities are less than 10% or greater than 90%. One of the key goals of Metaculus is to forecast these kinds of consequential tail events, and we believe log scores are better for us. Also log scores are pretty cool.
The Baseline score compares a forecast to an impartial baseline: a flat prior. Because the flat prior is pretty easy to beat, many of your Baseline scores will be solidly positive, even for difficult questions. Though they will still be negative if you get things wrong!
We like having a score that users can reliably increase with time, effort and skill. This score is more geared towards forecasters who like to predict all questions on Metaculus, and new forecasters who are not yet comfortable with scores. It should replace the Metaculus Points in spirit, while being dramatically simpler.
The Peer score compares a forecaster to all other forecasters on the same question. Because it compensates for question difficulty, it is a natural way to compare time-constrained forecasters who may not answer the same questions.
The Peer score conceptually replaces the Relative score. The Relative score compares a prediction to the Community prediction.
Because the Community Prediction is very hard to beat, most forecasters had negative Relative scores. Many found this demoralizing, and it disincentivized participation in Tournaments. By contrast, the Peer score guarantees an average of 0, and many users have positive average Peer scores. We are investigating the possibility of using the Peer score in future Tournaments, stay tuned for news on that front!
We also like that the Peer score is independent of the Community Prediction. In the past we’ve considered upgrading the Community Prediction aggregation algorithm, but we were unhappy with the negative impact that would have had on Relative scores. We will now have more freedom to change our aggregation methods in the future.
We would like to thank Nuño Sempere for first suggesting the idea behind Peer scores!
The old Metaculus Points, and the new Baseline Accuracy medals, reward predicting many questions. We also wanted to reward forecasters who like to focus more intensely on fewer questions. This is where the Peer Accuracy medals come in.
The first naive thing we tried was ranking users by their average Peer score. It was immediately apparent that the “winners” has only predicted a few questions (less than 5) but got extremely lucky. The results were dominated by luck, not skill.
The second thing we tried was setting a threshold of 30 questions. This time all the “winners” had predicted barely more than the threshold (between 30 and 40). As we adjusted the threshold up or down, the top ranks were dominated by those who predicted 1-20 questions more than the threshold. We didn’t like that the results were so sensitive to the choice of threshold. We also didn’t like that someone who missed the threshold by 1 question would be excluded entirely.
We settled on the final system: if you have predicted fewer than 40 questions, we impute you enough virtual questions with a score of 0 to bring your total to 40 questions. This solves all the problems above: there is no hard threshold, lucky forecasters do not rank well, and you only need to predict about 40 questions to rank well.
The perfect metric
We don’t think a single perfect metric exists. Any metric will only capture some facets of what we call “forecasting skill”. However, we think Baseline and Peer Accuracy each cover an important part of it. And, we have a few more ideas to extend this new system, so we can measure even more facets of forecasting skill.
Combined vs. separate scores
Metaculus Points were (roughly speaking) a combination of a Baseline and a Peer score, with some extra weights depending on the number of forecasters. This made them much harder to understand, use, and interpret.
By separating the Baseline and Peer scores, and making them simpler, we now have metrics that anyone can compare and weigh, and arrive at their own conclusions.
Scaling to 100
The usual formulations of scores (like Relative scores) yield numbers around the [0, 1] range. Our scoring survey showed that most people prefer scores that go to 100. So we decided to multiply scores by 100! As an added benefit, it makes decimal points superfluous, which looks nicer.
Global vs. local scores
We considered at length the benefits and drawbacks of global and local scores for continuous questions.
Global scores use your entire probability distribution to determine your score. A nice benefit is that they are sensitive to location: a prediction near the actual outcome receives a better score than a prediction further away. Two main downsides are that they are a bit more complex, and many people find them unintuitive for bimodal distributions (as we observed in our Scoring survey).
Local scores only use the height of your probability distribution at the resolution value to determine your score. This is what Metaculus has been using historically. They don’t reward predictions that are closer to the outcome (which many people found counterintuitive in our Scoring survey), but they are a bit simpler.
Both global and local scores are proper, and both have their strengths and weaknesses. We hoped that the user survey would show a strong preference, but alas results were fairly split. In the end we decided to keep using local scores.
Medals and leaderboards
Goal of medals
Our primary goal with medals is to reward excellence. We want forecasters to find them meaningful and motivating. We also want non-forecasters to understand, and see the value of medals and the forecasters who earned them. We want forecasters to get credit for their accomplishments, and maybe even, who knows, include their medals in their resume.
Fixed numbers vs. percentiles
The number of forecasters on Metaculus has grown over time and varies a lot between Tournaments and Time Periods. This prevents us from giving the same number of medals on each leaderboard.
So we decided to base medals on percentile rank. This scales nicely with the number of forecasters. When looking at past tournament and leaderboards, it seemed that some of the top 10% were getting lucky or not trying hard enough, so we set our threshold at 5%. This makes receiving a medal achievable, and also something to be proud of.
What is more valuable: a single comment with 100 upvotes or 50 comments with 2 upvotes each? Or 10 comments with 10 upvotes each? Counting total upvotes would incentivize writing many low quality comments. Rewarding average upvotes would disincentivize comment writing. We want to reward a balance of quality and quantity, so we decided to use h-indexes.
Also h-indexes are cool. Less cool than log scores, obviously, but still cool.
Time periods and durations
With Time Periods and durations, we aimed to achieve these goals:
- Stay simple - they should be as simple and easy to understand as possible
- Provide a fair starting line - a new forecaster should not be at a disadvantage to veterans. If questions that started before the time period were included in it, it would be very hard for newcomers to compete.
- Include all questions - all forecasts should count in exactly 1 Time Period. None are ignored, and none are double-counted.
- Don’t reward optimism - many questions are of the form “Will event X happen by year Y?”. Notice that they can only resolve Yes early, and No at the end. At first only the Yes questions resolve: optimists have better scores and pessimists worse ones. Then the No questions come in, and everything equalises. If we counted questions when they resolve, optimists and pessimists would at first not be correctly punished for their bias. Instead, we have to count questions when they were scheduled to end.
- Measure long-term accuracy - there are more short- than long-term questions. If we mixed them in the same leaderboard, most of the long-term accuracy signal would be drown out.
1 year vs. 6 months durations
We considered awarding medals on shorter time periods including 3 and 6 months medals. Unfortunately the shorter durations had too few questions, and luck played far too strong a role in the rankings.
Time periods: resolution vs. closing dates
A common form of question is: Will X happen in 2022? Often it will open in early 2022 and close on Dec 31, 2022. It would feel natural to include it in the 2022 accuracy leaderboards.
Unfortunately, there is often a delay between a question closing and the resolution being known. For example, US GDP numbers are usually only available in March of the next year. We wanted those questions to still count for the year they ask about, so we added a resolution buffer of 100 days after the end of each Time Period. Some questions won’t make the cut, but we didn’t want to wait months or years before we could award medals (3 months is already a lot!).
Inclusive vs. exclusive durations
Each question only counts toward 1 Baseline Accuracy & 1 Peer Accuracy Duration. This keeps long-term questions together, and rewards performance at long-range forecasting.
The Questions Writing and Comments leaderboards work differently. We wanted to reward questions that receive a lot of engagement year after year, so we chose to count the number of unique forecasters on a question in each year. This means that 1 great question can contribute to many medals.
Some Ideas for the Future
A key goal of ours is to reward those forecasters who have the most accurate judgment on important real-world events.
A big challenge is that some forecasters would like to be able to make a single forecast on a key event and not need to update in order to get a good score. Most forecasting sites (including Metaculus) use time-averaging over a question lifetime to calculate question scores. In order to do well, a forecaster must frequently update their forecast in response to news or simply the passage of time. If a user forgets to update then they may receive a poor score, not because of their judgment but because they didn’t have time to update. This makes it harder to measure and reward superior judgment.
Markets are nice in that they don’t require updating - seemingly solving this problem. However, markets can fail to achieve our goal in a couple of other ways:
- When there is important breaking news, a trader who sees the news first will be able to make lucrative trades. This is does not reflect their superior judgment, but their ability to track news faster than others.
- A trader can identify a stale market or a small inefficient forgotten market, and make good profits. Again, this does not reflect superior judgment, but the ability (and willingness) to find bad markets.
- A trader can anticipate how others will react, and trade profitably. This also certainly requires skill, but it does not achieve our goal of rewarding superior judgment about real-world events.
To date, we do not believe that either forecasting platforms or markets have adequately addressed the goal above. We have a few ideas for how to do better by building on the new Peer scores and Medals.
Idea # 1 - Only tally scores at the end of the hidden period
Many questions on Metaculus have a hidden period, during which the Community Prediction is not visible. Users must rely on their own judgment and research to forecast. The hidden period is usually quite short: by default 2 to 3 days for most questions, though it can last weeks for some Tournament questions.
We could have a leaderboard that only considers the last prediction made before the hidden period ends to calculate Peer scores. This largely achieves the goal above: it rewards judgement, it does not require updates or tracking the news constantly. It does not reward finding stale questions.
Note: This is similar in spirit to Scott Alexander’s annual forecast challenge where forecasters are scored based on their forecasts at one point in time, shortly after the contest opens.
Idea # 2- Rank based on Peer scores at 10% of a question’s lifetime
Similar to Idea #1, users would be scored based on their standing forecast when a question is 10% of the way through its lifetime. This allows more time for forecasters to make their initial forecasts. Unlike the idea above, the Community Prediction may be visible. This idea also does not require updating, and it scores forecasts made long before the outcome is known, so we believe it would reward superior judgment.
Because Peer Accuracy leaderboards use an average score, strategic users may be discouraged from participating in some questions if they are solely focused on Peer Accuracy. We have a couple other ideas that might address this, albeit at the expense of a bit more complexity:
Idea # 3 - Allow users to not include a question in their Peer Accuracy
Before they make their first prediction on a question, each user would choose whether they want this question to count in their Peer Accuracy average. This selection would be irreversible, but it would mean that a forecaster could strategically select the questions that they want to count for Peer Accuracy. This would eliminate the disincentive to participate on other questions that they don’t want included.
Idea # 4 - Weight peer scores by how long their predictions stood
Instead of using the naive average of Peer scores over all your questions, the Peer leaderboards could use a weighted average based on what proportion of the question lifetime you had predictions standing.
For example, say you’ve already predicted 40 questions with an average Peer score of +20. You predict one more starting at 50% of its lifetime, and over that second half your instantaneous Peer score is consistently +20 again, so that your time-averaged Peer score is +10. With the current Peer leaderboard, your unweighted average would be . Under this idea, your new weighted average would be .
Simplicity was a key goal in our decisions regarding scoring & medals. We think we landed on a system that is a significant improvement. All of the ideas above add complexity to the scoring system, but we think they have the potential to improve how we measure forecasting accuracy in the future.
Metaculus’ new scoring and medal system provides a great foundation to expand into more advanced measures and future medal categories. We’re excited to hear what you think about the proposals above! Please let us know your thoughts in the comment section below.