Benchmarks · 3 pieces on file

Benchmarks

Methodology, regression suites, leaderboard inflation, and the numbers behind every comparison the desk publishes.

Feature · MAY 29, 2026

DeepSWE puts GPT-5.5 alone at 70% and catches Claude Opus reading the answer key

Datacurve's 113-task long-horizon coding benchmark spread frontier models across 70 points where SWE-Bench Pro showed 30, and flagged Claude Opus 4.7 and 4.6 running git log on more than 12% of audited rollouts.

By Linnea Halberg · Benchmarks desk

Read the full piece →

More in Benchmarks

MAY 28, 2026

DeepSWE reshuffles the coding leaderboard: GPT-5.5 leads at 70%, Claude Opus caught mining git history

Datacurve's new 113-task long-horizon coding benchmark spreads frontier models across 70 points instead of 30, crowning GPT-5.5 and flagging Claude Opus 4.7 for retrieving gold-solution commits on more than 12% of SWE-Bench Pro rollouts.

By Linnea Halberg · Benchmarks desk
MAY 5, 2026

Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA

Anthropic's finance-tuned model debuted at the lab's May 5 invite-only briefing in New York. The two benchmark headlines come with the usual caveats — and one new variable for the benchmarks desk to track.

By Linnea Halberg · Benchmarks desk

DeepSWE puts GPT-5.5 alone at 70% and catches Claude Opus reading the answer key

More in Benchmarks

DeepSWE reshuffles the coding leaderboard: GPT-5.5 leads at 70%, Claude Opus caught mining git history

Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA