ai agents llm investment banking benchmark

Your AI Can't Close a Deal: The Brutal BankerToolBench Results

A new benchmark finds top models like GPT-5.4 and Claude Opus fail to produce client-ready investment banking deliverables, exposing deep flaws in business logic, code generation, and data fabrication.

April 2026 4 min

Your AI Can't Close a Deal: The Brutal BankerToolBench Results

When 500 investment bankers from Goldman Sachs, JPMorgan, and Morgan Stanley review AI-generated financial models and slide decks and deem every single one unfit for a client, the industry should pay close attention. That’s the blunt verdict from BankerToolBench, a new open-source benchmark that put nine top models to work on the kind of tasks junior bankers grind through daily. Not a single output was rated ready to send as is. GPT-5.4, the top scorer, managed a meager 58.1 out of 100, with only 16% of its outputs judged useful as a starting point. Claude Opus 4.6 looked polished but was hollow inside. The test goes beyond standard language tasks and exposes a fundamental gap: today’s most advanced AI agents can’t handle complex, multi-step professional work with the precision and reliability the real world demands.

The benchmark’s design is what makes these results so damning. BankerToolBench doesn’t ask for text suggestions; it demands the actual deliverables a junior banker would submit to a supervisor: working Excel models with dynamic formulas, PowerPoint decks compliant with bank style guides, and written memos. Agents must navigate data rooms, pull from platforms like FactSet and Capital IQ, and parse SEC filings, with a single task triggering up to 539 model calls. The grading rubrics were built by practicing bankers and average 150 criteria per task, covering technical correctness, client readiness, compliance, and consistency. The verifier, dubbed Gandalf, matches human reviewers 88% of the time, lending statistical rigor. This isn’t a toy problem; it’s a faithful, high-fidelity simulation of investment banking workflows.

And the failures are instructive. The most common, at 41%, are bugs in code and formula generation. Agents call non-existent Python functions and then delete the offending line rather than fixing the logic. Claude Opus 4.6, despite topping the client-readiness scoring, hardcodes key financial figures instead of building formulas, making scenario analysis impossible—a dealbreaker in any serious financial model. In 27% of errors, business logic breaks down entirely, like adding cost synergies to revenue. In 13% of cases, agents fabricate data and present it as sourced. Even subtle polish can’t mask a foundational failure: the systems understand surface patterns but not the causal structure that underpins financial reasoning.

These findings should sober up anyone building for high-stakes professional domains. The AI community has sprinted forward on coding benchmarks and chatbot fluency, but real economic work—management, law, finance—remains woefully underserved. BankerToolBench joins a growing chorus of research showing that even the best models flounder when autonomy is combined with complexity. One might argue that GPT-5.4’s 58% is a glass half full, but when a model requires major rework 41% of the time and is completely unusable another 27%, you can’t deploy it in production without heavy guardrails. The fact that reinforcement learning boosted a small Qwen model’s score by a factor of 13 from a very low baseline is encouraging, but that’s like celebrating a D student’s improvement to a C+. The ceiling is still far below professional threshold.

The implications for builders are clear. First, we need benchmarks that measure end-to-end task completion with real-world tools, not just text outputs in a sandbox. Second, model architects must address the hallucination and logic gaps directly through better tool integration and domain-specific fine-tuning. Anthropic’s recent work on seamless tool switching and data platform plugins shows the right direction, but the journey is long. Third, the industry must recalibrate expectations: AI that “assists” is not the same as AI that “delivers.” BankerToolBench proves that these agents are best treated as overconfident interns who need constant supervision, not as autonomous junior bankers. Until that changes, the only thing clients would reject faster than a flawed model is the idea that it’s ready to replace their human advisor.