Benchmarks · MAY 5, 2026
Claude Opus 4.7 leads Vals AI's Finance Agent benchmark at 64.4%; tops GDPval-AA
Anthropic's finance-tuned model debuted at the lab's May 5 invite-only briefing in New York. The two benchmark headlines come with the usual caveats — and one new variable for the benchmarks desk to track.
Anthropic on May 5, at its invite-only financial-services briefing in New York, debuted Claude Opus 4.7 and disclosed two benchmark headlines that anchor the new model's launch positioning. Opus 4.7 currently leads Vals AI's Finance Agent benchmark at 64.4% and tops the GDPval-AA evaluation for economically valuable knowledge work.
The two numbers are, in 2026, the most relevant frontier-benchmarks for a finance-tuned model release. They are also, importantly, both relatively recent constructions — Vals AI's Finance Agent leaderboard was assembled to evaluate exactly this category of model, and GDPval-AA grew out of the broader GDPval line of evaluations that have gained traction over the past twelve months.
### The benchmark, in shape
Vals AI Finance Agent. The leaderboard evaluates frontier models on multi-step financial-analyst tasks: SEC filing analysis, ratio computation across statements, cross-period earnings comparisons, M&A scenario modeling, and structured-data parsing from filings. Models are scored on agentic completion of a full task chain, not on single-prompt accuracy.
GDPval-AA. A separate but adjacent line of evaluation focused on tasks that map to economic value in real-world work — the eval is constructed to test the model's behavior in contexts where the operator cost of getting it wrong is non-trivial.
### What 64.4% means
The Vals AI score of 64.4% means Opus 4.7 successfully completes ~64% of the benchmark's task chains end-to-end. The score is well clear of the prior-generation Anthropic and competing-lab models on the same leaderboard, but the floor on the benchmark is higher than most prior-generation finance evals — easier tasks have been deliberately filtered out as models have improved.
### What we don't have
- Per-task-category breakdowns inside the 64.4% headline (which task types Opus 4.7 wins and which it doesn't).
- Public comparison points for GPT-5.5 Instant, GPT-5.5 Pro, or Gemini 3.5 Pro on the same Vals AI tasks at launch — those models have not been formally submitted to the leaderboard yet.
- The version of Opus 4.7 used for the benchmark — whether the public API version exactly matches the benchmark configuration is not stated.
### The customer signal
The benchmark headline ships alongside a deployment headline: Anthropic disclosed JPMorgan Chase, Goldman Sachs, Citi, AIG, and Visa as Claude users in production. The deployment claims are, in operator-impact terms, the more load-bearing signal — but the benchmark gives the benchmarks desk something to actually replicate.
The right next move for any benchmarks-focused operator is to run an internal version of the Vals AI Finance Agent task set against the public Opus 4.7 endpoint and compare to the launch headline. If the reproduced score is meaningfully below 64.4%, the launch number is a configuration claim, not a portable benchmark.