M

Cryptocurrency competition analysis

The results are in, and a big congrations go to our winners jzima and Barbarossa and our runners up AndrewMcKnight and ElliotOlds. But how did everyone else do? In short, not so well.

Our winner received 748 points, but the median score amongst all cryptocurrency players was only 7 points and the average score was way down at −73 points. Only 25 players broke 100 points. Player overconfidence combined with our log scoring system really hurt here: just one very overconfident prediction could easily send someone's score tumbling down from the top of the ranks. That said, there was a top tier of eight players who got at least 590 points (with a large gap separating the 9th player), and, as we'll see below, they used different confidence strategies to get there.

Community performance on binary predictions

The binary questions in the Cryptocurrency competition had especially poor predictions. We have a long track record of resolving binary predictions, so we can easily compare player performance here with player performance on non-cryptocurrency questions.

Each of the above graphs show the community prediction distributions averaged over a set of questions. Any question that resolved negatively has been inverted, so the “best” prediction in each case is 99%. A community of excellent predictors will have many more predictions on the right side of each graph than on the left. This is, in fact, what we see with the non-cryptocurrency questions. There is some notable overconfidence in the low end — many people are 99% sure of the wrong results — but overall the distribution is skewed right. The cryptocurrency contest, on the other hand, shows an almost flat prediction distribution, indicating that the average player did no better than randomly guessing. Even worse, more players were 99% confident in the wrong answer than were confident in the correct answer.

Might the community have simply been unlucky? It's possible that the community prediction was more or less correct in divining the questions' “true” probabilities, but, probabilities being probabilities, merely ended up on the losing side of a sequence of coin flips. We can do a simple Monte Carlo simulation to see how likely this is.

We simulated the Brier score for both the Metaculus prediction and the community prediction averaged across all questions assuming that the predicted probabilities were correct and that the questions resolved independently. Each simulation randomly resolves each question positively or negatively according to those probabilities, and the average Brier score is added to the distribution shown in the above graph. There is a long tail of bad luck in these distributions, where even guessing the correct probabilities results in a high (bad) Brier score. However, it's not long enough. We can say with greater than 95% confidence that bad luck alone cannot account for the poor performance of the community prediction (Brier score = 0.278) or the Metaculus prediction (Brier score = 0.273). Either the question outcomes were (plausibly) very highly correlated, or we just got these questions wrong.

Individual performance on binary predictions

There were 13 non-ambiguous binary questions, and no one correctly predicted all (or even a majority of) the outcomes with high confidence. Almost everyone lost some points on some questions. The largest number of total points won by a player on binary questions was from ElliotOlds with 288 points. Our overall winner jzima, actually lost 7 points on the binary predictions, making up the deficit with a stellar performance on the non-binary questions. The following plot shows scores of the top ten point earners for these binary questions along with their final predictions. Again, negatively resolving questions have been inverted such that 99% is always the best.

There is a variety of strategies on display here. Some players, like ElliotOlds, conservatively made most of their predictions in the 25%–75% range with a few predictions at higher confidence. Others chose a much more aggressive strategy and ended up with nearly as many points. The most cautious player here received the least points (amongst the top ten), but ended up with a relatively good Brier score and could have received many more points by predicting earlier.

A few of our users have noted that the competition structure incentivizes overconfidence. Strictly, this is true. A player could make a gamble and predict 99% on a question which they think has only a 50% chance of resolving positively. If they lose the gamble, they'll effectively be out of the competition, but if they win they'll have an advantage over those that made the “correct” prediction, and be more likely to make it to the top of the rankings. Several players tried this strategy without great success. The log score is too punishing for overconfidence, and there were too few players along with too many questions for a single player to get very lucky.

Non-binary predictions

The non-binary questions in the cryptocurrency contest are amongst the first non-binary questions that Metaculus has resolved. The good news is: players actually did quite well on these! The worst total point losses were about the same for both the binary and non-binary questions, but the best point gains were much better in the non-binary case. The winning strategy was the obvious (but difficult) one: make predictions that are accurate, not too broad, and not too precise. The following shows all of jzima's predictions, where the x-axes have been translated and rescaled such that each resolution is at zero and the community prediction distributions have a standard deviation of one.

jzima had consistently accurate predictions, and, unlike some other players, the predictions weren't terribly narrow.

We can also look at the calibration of all user's predictions. For example, if users as a whole are perfectly calibrated, then the question resolutions ought to fall outside of the 50% prediction intervals about half the time. And, since the predictions are Gaussian, they ought to fall outside of the 2σ range about 5% of the time. Indeed, that is just what we see happen.

The above plot shows a histogram of where the resolutions landed within individual player's predictions. This distribution looks remarkably close to a normal distribution. The solid yellow line shows a standard normal distribution (zero mean and unit variance), whereas the dashed line shows the best fit to the data. What this means is that, on the whole, our player's predictions seem to be an unbiased estimate of a question's true probability distribution.

Takeaways

Cryptocurrencies seem to be very difficult to predict, and the Metaculus community didn't perform quite as well as usual. These were hard questions; don't feel badly if you didn't make it to the top of the ranks!

The competition was a fun experiment and we're excited to do more in the future. If you have ideas for a competition that you'd like to see Metaculus run, or thoughts on how competitions can be improved, please let us know in the comments!