• Questions
  • Tournaments
  • Services
  • News
  • Questions
  • Tournaments
  • Questions
  • Questions
Feed Home
👥
Communities
🏆
Leaderboards
💎
Metaculus Cup
🇮🇷🇮🇱
Iran-Israel Conflict
⚡
Current Events
🏛️
POTUS Predictions
💵
Fiscal Showdown
Topics
✨🔝
Top Questions
🇺🇦⚔️
Ukraine Conflict
🏦
Big Beautiful Bill
🗽
State of the Union
⏳
AI 2027
🇹🇼🇨🇳
The Taiwan Tinderbox
categories
🦠
Health & Pandemics
🌱
Environment & Climate
☢️
Nuclear Technology & Risks
🤖
Artificial Intelligence
See all categories
  • About
  • API
  • FAQ
  • forecasting resources
  • For Journalists
  • Contact
  • Careers
GuidelinesPrivacy PolicyTerms of Use
ForbesScientific AmericanTimeVoxYale NewsNature

Q1 AI Benchmark Results: Pro Forecasters Crush Bots

14
3 comments3
Q1 AI Forecasting Benchmark Tournament

Contributed by the Risk Threshold Forecasting community.

When will Anthropic first report that an AI system reached AI R&D risk level 5?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Gemini 2.5 Pro scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Gemini 2.5 Flash scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will 80% accuracy be achieved on Cybench by a Grok 3 scale model by xAI?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Gemini 2.5 Pro scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Gemini 2.5 Flash scale model by Google?

Contributed by the Risk Threshold Forecasting community.

When will an 8 hour, 80% reliability time horizon be achieved on METR’s Autonomy Tasks by a Grok 3 scale model by xAI?

Contributed by the Risk Threshold Forecasting community.

When will 75% accuracy be reached on LAB-Bench Cloning Scenarios by a Claude Sonnet 4 scale model by Anthropic?

Contributed by the Risk Threshold Forecasting community.

When will 75% accuracy be reached on LAB-Bench Cloning Scenarios by a Gemini 2.5 Pro scale model by Google?