Benchmarks · 1 piece on file
Benchmarks
Methodology, regression suites, leaderboard inflation, and the numbers behind every comparison the desk publishes.
Feature · MAY 5, 2026
Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA
Anthropic's finance-tuned model debuted at the lab's May 5 invite-only briefing in New York. The two benchmark headlines come with the usual caveats — and one new variable for the benchmarks desk to track.