30 days.
That's how long it took two of the most powerful AI models on the planet to make the exact same bet: move to Microsoft Excel and fight for financial workflows.
Claude Opus 4.6 shipped on February 5th with Excel integration, enterprise plugins for investment banks, FP&A and PwC as implementation partner. GPT-5.4 followed on March 5, delivering ChatGPT for Excel add-ins and live data connections to Moody's, S&P Global, FactSet, and LSEG. Same month. It's the same application. different architectures. different ecosystems. Both were aimed squarely at your team.
Meanwhile, Gemini 3.1 Pro was quietly expanded to a 2 million token context window. DeepSeek V3.2 continued to drive prices down to a fraction of a penny per token. And the launch of Anthropic's enterprise agents sparked what traders call a “SaaSpocalypse.”: On February 3rd, approximately $285 billion of market capitalization disappeared in one trading day. across software and services stocks. Thomson Reuters recorded its biggest single-day decline ever. Salesforce and ServiceNow each fell about 7%. Intuit and Equifax lost more than 10% each.
The market wasn't betting on which model would win. This envisioned a world where AI co-pilots built into existing tools would be replaced by standalone software sold by companies.
For financial leaders, the questions have changed. The question is no longer “Should we introduce AI?” It's “which models fit into which workflows, and how do you manage your portfolio?”
what's actually inside the box
Both Excel copilots work from a side panel within a workbook. Explain what you need in plain language. This tool builds, updates, or debugs models using expressions and structures already contained in your files. Both ask for permission before making edits. Both link the description to a specific cell. Both perform calculations natively in Excel, rather than inside the model's black box. The last part is what makes both tools auditable.
The difference lies in the ecosystem around spreadsheets.
OpenAI dug deep into the data. Excel add-in for GPT-5.4 Connect directly to Moody's, S&P Global, Dow Jones Factiva, LSEG, MSCI, and FactSet. Get credit metrics, returns, transcripts, and market data without switching windows. The company also shipped reusable “skills” Regular financial tasks: Earnings previews, DCF analysis, comparables and draft investment notes. The positioning is clear. OpenAI is building financial terminals inside chatbots.
Anthropic expands your workflow. Claude in Excel is part of a broader corporate initiative that began with Claude for Financial Services. In late February, Anthropic launched pre-built agents For financial analysis, stock research, private equity, asset management, etc. PwC, Accenture and Deloitte have signed on as implementation partners. Last week, the company launched a marketplace Here, enterprise customers can use their existing commitment spend to purchase Claude-powered tools from partners like Snowflake and Harvey. Positioning is equally clear. Anthropic builds operating systems for enterprise knowledge work.
Two architectures. Two strategies. The spreadsheet is just an entry point.
72/14 problem
Here are the stats you should reframe every model comparison you read this year: According to an RGP survey of 200 US CFOs, 72% are currently using AI tools. Only 14% report a clear and measurable ROI.
The barrier is not about the intelligence of the model. Only 10% of CFOs surveyed fully trust company data. 86% say their legacy systems limit their AI readiness. 68% cite the skills gap as their biggest challenge.
Competitive speed benefits finance teams, but only when evaluated based on workflow suitability rather than key benchmarks.
There are common patterns among organizations that bridge this gap. Route specific models to specific workflows rather than standardizing on a single vendor. Brex uses the Opus-based system to automate 75 percent of its expense transactions, achieve a 94 percent policy compliance rate, and save approximately 169,000 hours per month. TELUS runs over 13,000 internal AI tools built on Claude, saving 500,000 hours and realizing approximately $90 million in benefits. Lloyds Banking Group expects the value of agent AI to increase by around £100m this year. BNY Mellon runs 117 agent tools in production across operations and risk.
These are not “ChatGPT vs. Claude” stories. They are about architecture. The model is one variable. The remaining three are integration suitability, governance posture, and data readiness.
Governance layer that cannot be skipped
Choosing a financial model is not just a matter of capability. That's a vendor risk question. Two developments over the past few weeks illustrate this.
The same week that Anthropic confirmed that its ARR had nearly doubled to $19 billion since late 2025, Secretary of Defense Hegseth designated the company as a supply chain risk. This label has historically been reserved for foreign adversaries. The controversy centers on restrictions Anthropic has pushed for on the use of military AI. For finance teams evaluating long-term vendor commitments, this introduces a variable measurement without a benchmark.
In terms of cost efficiency, DeepSeek V3.2 is attractive. At $0.55 per million tokens, it is 10-25 times cheaper than proprietary alternatives. However, a September 2025 NIST assessment found that the DeepSeek model was compliant with 94 percent of malicious jailbreak attempts. This compares to 8 percent in the U.S. reference model. Agents built on DeepSeek were 12 times more likely to fall victim to a hijacking attack, where the agent is redirected from a task by malicious instructions embedded in an external document. This is a serious governance issue for workflows that touch external data.
And then there's the cost architecture. GPT-5.4 runs at the standard rate of $2.50/$15 per million tokens, but the price doubles when input tokens exceed 272,000 tokens. Opus 4.6 starts at $5/$25, but the moment you exceed 200,000 input tokens you hit the “200,000 cliff” where the entire request is repriced to $10/$37.50. Both models penalize sloppy context management. Architectural decisions about how to configure prompts and manage token budgets can cause costs to fluctuate by as much as five orders of magnitude per year.
3 points
- Competition is the story. Both Frontier models shipped the Excel Copilot within the same 30-day period. Gemini is expanding the context. DeepSeek is compressing costs. Competitive speed benefits finance teams, but only when evaluated based on workflow suitability rather than key benchmarks.
- 14 percent demonstrated ROI matching model and process. Not standardized to one vendor. Route modeling to one tool, diligence to another, and bulk extraction to a third tool. Governance encompasses all of that.
- Governance is the real selection criterion. Data integration, token economics, vendor risk posture, and enterprise controls will determine which models remain in the stack after the current benchmark scores are retired.
