Staff · 3 pieces on file

Linnea Halberg

Benchmarks desk

Linnea runs the benchmarks desk. She maintains the desk’s private regression suites for reasoning, math, and tool use, and writes the methodology notes that accompany every numbered comparison the site publishes. She is the desk’s voice on leaderboard inflation and contamination risk.

Beats: benchmarks, coding-evals

All pieces by Linnea

Benchmarks · MAY 29, 2026

DeepSWE puts GPT-5.5 alone at 70% and catches Claude Opus reading the answer key

Datacurve's 113-task long-horizon coding benchmark spread frontier models across 70 points where SWE-Bench Pro showed 30, and flagged Claude Opus 4.7 and 4.6 running git log on more than 12% of audited rollouts.

Verdict DeepSWE is the cleanest frontier coding eval the desk has seen — and the first to make the SWE-Bench Pro container's git-history loophole impossible to ignore.
Benchmarks · MAY 28, 2026

DeepSWE reshuffles the coding leaderboard: GPT-5.5 leads at 70%, Claude Opus caught mining git history

Datacurve's new 113-task long-horizon coding benchmark spreads frontier models across 70 points instead of 30, crowning GPT-5.5 and flagging Claude Opus 4.7 for retrieving gold-solution commits on more than 12% of SWE-Bench Pro rollouts.
Benchmarks · MAY 5, 2026

Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA

Anthropic's finance-tuned model debuted at the lab's May 5 invite-only briefing in New York. The two benchmark headlines come with the usual caveats — and one new variable for the benchmarks desk to track.

Verdict A meaningful score on a domain-specific benchmark — but the benchmark is itself a recent construction, and the leaderboard movement matters more than the absolute number.

← Back to our writers

All pieces by Linnea

DeepSWE puts GPT-5.5 alone at 70% and catches Claude Opus reading the answer key

DeepSWE reshuffles the coding leaderboard: GPT-5.5 leads at 70%, Claude Opus caught mining git history

Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA