Submit Essay

Once you submit your essay, you can no longer edit it.

# Project Summary

• The Keep Virginia Safe Tournament was a joint effort between Metaculus and the Virginia Department of Health (VDH)
• Questions were developed collaboratively between the partners in order to obtain insights that would maximize usefulness to policy makers
• Questions spanned 4 focus areas, centered around issues most relevant to Virginia’s pandemic response
• Dates: April 29 2021 until April 30 2022.
• 3 tournament rounds (starting April 29 2021, August 3 2021, November 24 2021)
• There were 86 questions, 224 forecasters, and 14937 forecasts
• 80 questions are already closed, and 6 long-term questions will close after the tournament has ended

## Metaculus forecasts

The forecasts generated by Metaculus are particularly suited to the needs of public policy makers such as those in the Virginia Department of Health. Metaculus provides full predictive distributions, meaning that policy makers do not only obtain a single best guess, but instead an exact quantification of uncertainty in outcomes. This enables them to get a better picture of possible outcomes and especially the likelihood of extreme events. Forecasters continuously update their forecasts based on new information, meaning that predictions always represent the currently available information. Human forecasters on Metaculus also have several advantages over computer-based modeling in the context of informing public health policy in realtime. They are able to provide an early information signal as they are able to incorporate incomplete data and qualitative information such as changes in trends that are hard to capture in numbers. Many forecasters explain their reasoning in the comment section of the Metaculus website, providing additional information to policy makers. Compared to mathematical models, human forecasters are also more flexible and can quickly answer new questions when they arise without requiring time to develop and tune models. The machine-learning-optimized Metaculus algorithm aggregates individual predictions in a way to provide the most accurate forecasts.

## Forecast targets

Forecasting targets were chosen to maximize the usefulness of the forecasts to policy makers. Together, VDH and Metaculus identified four key areas of questions:

1) The COVID-19 Epidemiological Trajectory

Information on the future trajectory of COVID-19 is central to a variety of aspects of Virginia’s pandemic response. Forecasting questions were designed to help answer questions such as: Are existing measures and resources sufficient? Can schools be opened safely, and if so would that require measures such as mask use or increased testing efforts? How likely is it that future variants that escape immunity will emerge and what preparations need to be made? Is it safe to plan gatherings for holidays?

2) The Path to Population Immunity

Forecasts about population immunity helped VDH navigate and anticipate the impact of the still new vaccination campaign. They helped answer questions like: Will vaccination succeed in curbing infections enough to suppress transmission? Which vaccine distribution channels would have the most demand, and how would that change as the campaign matured? How would vaccine uptake vary? Which groups (e.g., age, race, ethnicity) would reach vaccination benchmarks first, and which would require more effort?

Forecasts about when and how Virginia would return to normal provided insight into the changing social, political, and economic effects of the pandemic. They helped to track the public’s reaction to the virus and the response, and tested assumptions about the “end game” of the pandemic. Relevant questions included: How will demand for testing strategies need to be adapted to accommodate return to school and work? Will pandemic response efforts successfully facilitate a return to normalcy, and will the public trust that the pandemic has been successfully mitigated? If it is not mitigated, will there be political and social will to continue to comply with public health guidance?

4) Health Equity

The COVID-19 pandemic has hit different parts of Virginia’s population differently. Forecasts about health equity helped the Virginia Department of Health mitigate the effects of COVID-19 on the most vulnerable. Relevant policy questions were, for example: Will efforts to close equity gaps be successful, and among which populations? Which populations may need additional effort or focus? Will additional resources be necessary to reduce health inequality? Are current outreach policies working, or do they need to be adapted?

## Number of questions, forecasts and forecasters

Throughout the tournament, there were 86 questions, 11 of them in a binary format (asking for a yes or no outcome) and 75 of them asking for a date or a specific quantity (such as e.g. hospitalizations).

In total, there were 224 forecasters, and 14937 forecasts, meaning that on average, every forecaster made 66.7 predictions. Many forecasters made over 100 forecasts and the most active over 1000. Most questions received well over one hundred forecasts (see Figure 1). Out of 86 questions, 56 questions have currently resolved within the bounds of the original question. 32 of these asked for a non-date quantity, 13 asked for a date, and 11 for a binary forecast.

Figure 1

# Accuracy Summary

Continuous (non-date) questions

The median of the final forecast was within 10 percent of the observed value for 19 out of 30 questions, and within 50 percent of the observed value for 27 questions.

Date questions

The final forecast was within 7 days of the observed date for 7 out of 13 questions, and within 14 days for 9 questions. All differences between predicted and actual dates are smaller than (or equal to) zero, meaning that events tended to happen earlier than on average predicted.

Binary questions

Forecasters assigned more than 50% probability to the outcome that was eventually observed on 5 out of the 7 questions that have already resolved. The average Brier score of the Metaculus prediction was 0.15.

Forecast updating

Forecasts were continuously updated. Peaks in activity occurred whenever new tournament rounds were launched.

Figure 2: Distribution of errors for continuous questions. A: Accuracy of the continuous (non-date) forecasts compared to the observed values. The percentage error indicates how much larger or smaller the forecast was in percentage terms relative to the corresponding observed value. B: Difference between the median final forecast and the observed date for date questions.

# Forecast Accuracy

## Non-date questions

Non-date (discrete or continuous) questions asked for a quantity such as the percentage of Virginia’s population that is vaccinated at a given time or the number of communities that will experience high community transmission. Out of the 30 of these questions that have been resolved, the median of the final forecast was within 10 percent of the observed value for 19 and within 50 percent of the observed value for 27 questions. Table 1 shows the questions for which the median of the final forecast was more than 50 percent away from the observed value. While these numbers have to be interpreted with a certain caution — for example forecasts were updated over time and questions varied in difficulty — they do give a good intuition for the overall accuracy of the Metaculus forecasts. Table 1 shows the questions for which Metaculus forecasts were least accurate, i.e. questions for which the median final forecasts was more than 50 percent higher or lower than the observed value.

Table 1: Continuous question for which the median of the final forecast was more than 50 percent off compared to the actually observed value.

## Date questions

Date questions asked for the exact time of an event, for example "When will the CDC eliminate quarantine restrictions for close contacts of COVID-19 cases?" or "When will a SARS-CoV- 2 vaccine be granted emergency use authorization by the US FDA for children under 12 years old?". Out of 13 of these date questions, the final forecast was within 7 days of the observed date for 7 and within 14 days of the observed date for 9 questions. Again numbers need to be interpreted with care, but overall Metaculus forecasts seemed to be reasonably accurate for most questions. Interestingly, all differences between predicted and actual dates are smaller than (or equal to) zero, meaning that events tended to happen earlier than they were on average predicted. Table 2 gives an overview of the questions for which Metaculus forecasts were least accurate, showing questions for which final forecasts were more than 14 days away from the observed date.

Table 2: Date question for which final forecasts were more than 14 days off compared to the actually observed date.

## Binary questions

Binary questions asked for a probability of a given event, for example "Will Virginia announce a vaccine mandate for its state workforce before 1 October 2021"? Out of 11 binary questions, 7 questions have already resolved. Forecasters had assigned more than 50% probability to the outcome that was eventually observed on 5 questions out of 7 questions. For 2, however, forecasters assigned less than 50% probability to the outcome that was finally observed (see Table 3).

The quality of binary forecasts can easily be evaluated using the Brier score, a scoring rule which ensures that forecasters cannot win points by 'cheating' and are incentivized to report their true best belief. The Brier score is measured as the squared distance between prediction and outcome (e.g. for a prediction of 80% probability for an event that happens, the Brier score would be (1−0.8)2. The average Brier score of the community prediction was 0.15, where 0 is perfect omniscience, 1 is the worst possible score and 0.25 is the score assigned to a forecaster who doesn’t know anything and always predicts a probability of 0.5. From a score of 0.15 we can infer that the Metaculus forecasts performed reasonably well and clearly outperformed the agnostic baseline of an unknowing forecaster.

Table 3: Questions for which forecasters assigned more probability to the outcome that was not observed.

## Calibration

In addition to accuracy, calibration is an important feature of good forecasts. Good calibration implies that forecasters are able to correctly assess their own uncertainty, instead of being overly cautious or assigning high probability to events that do not occur. One intuitive way to assess calibration is to ask "how often were forecasters right for a given level of certainty (probability assigned to an outcome)"? For binary questions, certainty is simply the predicted probability. For quantities and date questions, uncertainty is represented by the width of the predictive distribution. The most intuitive way to assess it is by looking at the central 50% prediction intervals, i.e. the range of possible outcomes for which forecasters believe that there is a 50% probability that the observed value will fall within that range. For the 43 date and non-date questions combined, the observed value was within the 50% prediction intervals of the Metaculus prediction 21 times (48.8% of times), indicating that forecasts were well calibrated and provided a reasonable quantification of uncertainty. For the binary questions, assessing calibration is harder, as there are only 7 resolved questions. For these questions, forecasters were never wrong when they expressed more than 70% certainty that an event would either happen or not happen.

## Forecast updating

While the final predictions are important, a lot of the value of the forecasts comes from the signal they provide over time. Forecasts for all questions were continuously updated as new information became available, providing decision makers at VDH with the best available knowledge in real time. Figure 3A shows a timeline of forecasting activity with continuous engagement over the whole period and spikes in activity when new questions were released. On average (but not always), updating should mean that forecasts also improve over time. This is indeed true in general for the Metaculus forecasts, although not for every individual question. Panels B-D in Figure 3 show the evolution over time for the Brier score (for binary questions), the percentage error (non-date questions) and the absolute error in days (date questions) of the median forecast.

Figure 3: Forecast updates over time. A: Brier score of binary questions. A brier score of 0.25 (dashed line) represents the score for an agnostic forecaster who always predicts a probability of 0.5. B: Difference between median predicted date and observed date for date questions over time. C: Percentage difference between median predicted value and observed value over time. D: Histogram with timings of forecasts.

## Discussion

The Keep Virginia Safe Tournament was a first-of-its-kind collaboration between a prediction platform and a public health agency. It helped turn insight into action by making distributed forecasting expertise available to decision makers at the Virginia Department of Health. Importantly, VDH was not only a forecast consumer, but was able to set key priorities for forecast elicitation and was actively involved in the question development process. This setup not only allowed for a timely provision of the most relevant and up-to-date forecasts, but it also led to a high level of satisfaction from both sides. The collaboration was able to successfully combine deep subject matter expertise with forecasting experience to deliver actionable insights.

Anchoring Metaculus forecasts among more traditional information sources, including surveillance reporting and quantitative models, improved acceptance and allowed busy decision makers to gain some familiarity with Metaculus products. In this context, Metaculus forecasts acted as useful benchmarks and filled gaps. For instance, VDH tracked bed capacity in hospitals, and UVA Biocomplexity Institute models projected bed use. However, there was little information available on staffing, a major limitation as the Omicron wave approached. Metaculus forecasts on average travel nurse salaries in Virginia filled the gap, contributing to key decisions on hospital and workforce flexibility. In addition to Virginia-specific questions, VDH staff were able to pull from the wide array of forecasts available on the Metaculus platform.

## Areas for improvement

The Keep Virginia Safe Tournament was a pilot project implemented during the height of a fast-moving public health emergency. While there were some notable successes, there are still lessons to be learned and areas for improvement. While some questions proved to be useful for policy discussions, few were linked directly to policy decisions. Further work and research is needed in this area.

Communication of Metaculus forecasts to lay audiences could be challenging at times, especially if forecasts were multi-modal, there were significant “off-scale” forecasts, or forecasts did not translate well to simple numeric presentation. Forecasts were sometimes refigured to create consumable results. For instance, questions about which month would experience peaks in cases and hospitalizations were converted into probabilities of whether “the worst was behind us”. (Stale forecasts could make this fraught, as they often included probabilities covering multiple past months, when only one past month could possibly be the peak.) Linking Metaculus forecasts to policy deci- sions and improving communications are key goals of future tournaments.

## Areas for further collaboration

VDH and Metaculus are planning a second tournament, addressing the areas of improvement noted above. COVID-19 and its effects will necessarily be featured, along with its long-term physical and mental health impacts. Areas outside of COVID-19, such as infant mortality, substance use disorder, and sexually transmitted infections will also be addressed.

VDH has also launched a new Foresight and Analytics unit within its Office of Emergency Preparedness. Over the long term, Metaculus forecasts may contribute to threat assessments and benchmarking. Short-term questions or tournaments informing on emerging threats are also a key area for collaboration. Metaculus, the UVA Biocomplexity Institute, and VDH have also pursued an innovative collaboration combining the benefits of traditional quantitative models and aggregate human forecasts. Quantitative models are highly dependent on assumptions for key parameters such as the characteristics of new variants, behavioral or policy responses, vaccine effectiveness, or waning immunity. Parameters for uncertain quantities are often selected by the modelers themselves. These may be used in scenario projections, or for making forecasts about future quantities. This collaboration is exploring using Metaculus forecasts to select parameters for quantitative models. These can be used to better define potential scenario arrays, select the most likely scenario among an array, or to create a Metaculus-informed forecasting model. We are certain to find other areas to explore, and look forward to further collaboration.

## Acknowledgements

We are grateful to the contributions of Caroline Holsinger and Justin Crow of the Virginia Department of Health who made this collaboration a success. We also would like to thank Gaia Dempsey, Tom Liptay, and Juan Cambeiro for making the tournament possible from Metaculus’s end and Nikos Bosse for writing this report.

# Appendix

## Example questions for the four core key areas

The COVID-19 Epidemiological Trajectory

• Will variants of concerns thought to partially escape immunity make up more than 50% of samples sequenced in Virginia on 29 August - 11 September 2021?
• How many new COVID-19 outbreaks will occur in Virginian long-term care facilities before 1 August 2022?
• In Virginia, which month between May 2021 and March 2022 (inclusive) will have the highest number of new COVID-19 hospitalizations?
• When will Virginia’s weekly total of new confirmed and probable COVID-19 hospitalizations fall below 49?
• How many of Virginia’s 133 communities will be experiencing moderate or higher levels of community transmission as of 7 March 2022?

The Path to Population Immunity

• What will the percent of Virginia’s population vaccinated with at least one dose be on 1 August 2021?
• When will the percent of Virginia’s population vaccinated with at least one dose reach 75%?
• Which age group will have the highest share of Virginia’s new COVID-19 cases during the week ending 1 August 2021?