Question Feed | Metaculus

Questions
Tournaments
Services
News

Questions
Tournaments

Questions

Questions

🇮🇷🇮🇱

Iran-Israel Conflict

POTUS Predictions

Fiscal Showdown

Topics

Ukraine Conflict

Big Beautiful Bill

State of the Union

🇹🇼🇨🇳

The Taiwan Tinderbox

categories

Health & Pandemics

Environment & Climate

Nuclear Technology & Risks

Artificial Intelligence

See all categories

About
API

FAQ
forecasting resources
For Journalists

Contact
Careers

Guidelines Privacy Policy Terms of Use

Q1 AI Benchmark Results: Pro Forecasters Crush Bots

14

Q1 AI Forecasting Benchmark Tournament

Contributed by the Risk Threshold Forecasting community.

When will Anthropic first report that an AI system reached AI R&D risk level 5?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Gemini 2.5 Pro scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Gemini 2.5 Flash scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Grok 3 scale model by xAI?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Gemini 2.5 Pro scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Gemini 2.5 Flash scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Grok 3 scale model by xAI?

Contributed by the Risk Threshold Forecasting community.

When will 75% accuracy be reached on LAB-Bench Cloning Scenarios by a Claude Sonnet 4 scale model by Anthropic?

Contributed by the Risk Threshold Forecasting community.

When will 75% accuracy be reached on LAB-Bench Cloning Scenarios by a Gemini 2.5 Pro scale model by Google?