Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
Ranked by overall

Leaderboard

Average 89.8 across 450 evaluations.

1gpt-5.4 highLeader
GPT-5.4 · high
92.0
Cost / call
$0.07
Range
87.096.0 · n=25
2gpt-5.5 none
GPT-5.5 · none
92.0
Cost / call
$0.17
Range
76.096.0 · n=25
3gpt-5.5 xhigh
GPT-5.5 · xhigh
92.0
Cost / call
$0.18
Range
76.097.0 · n=25
4gpt-5.4 xhigh
GPT-5.4 · xhigh
92.0
Cost / call
$0.07
Range
82.096.0 · n=25
5gpt-5.5 high
GPT-5.5 · high
91.7
Cost / call
$0.18
Range
72.097.0 · n=25
6gpt-5.5 medium
GPT-5.5 · medium
91.7
Cost / call
$0.17
Range
82.097.0 · n=25
7gpt-5.4 medium
GPT-5.4 · medium
90.9
Cost / call
$0.07
Range
81.096.0 · n=25
8gpt-5.4 none
GPT-5.4 · none
90.8
Cost / call
$0.07
Range
82.096.0 · n=25
9gpt-5.5 low
GPT-5.5 · low
90.8
Cost / call
$0.17
Range
84.096.0 · n=25
10gpt-5.4 low
GPT-5.4 · low
90.3
Cost / call
$0.07
Range
78.096.0 · n=25
11opus 4.7 max
Claude Opus 4.7 · max
90.2
Cost / call
$0.15
Range
72.096.0 · n=25
12opus 4.7 high
Claude Opus 4.7 · high
89.6
Cost / call
$0.12
Range
71.095.0 · n=25
13opus 4.7 xhigh
Claude Opus 4.7 · xhigh
89.4
Cost / call
$0.13
Range
75.095.0 · n=25
14opus 4.7 medium
Claude Opus 4.7 · medium
89.0
Cost / call
$0.10
Range
72.095.0 · n=25
15sonnet 4.6
Claude Sonnet 4.6 · default
88.8
Cost / call
$0.10
Range
84.095.0 · n=25
16opus 4.7 low
Claude Opus 4.7 · low
87.6
Cost / call
$0.09
Range
70.096.0 · n=25
17deepseek v4 pro
DeepSeek V4 Pro · default
85.8
Cost / call
$0.0047
Range
70.095.0 · n=25
18gemini 3.1 pro previewTrailing
Gemini 3.1 Pro Preview · default
82.4
Cost / call
$0.03
Range
67.093.0 · n=25
Filtered mean89.8
Top model
92.0avg

gpt-5.4 high

87.0–96.0 across 25 calls

Hardest call
79.7avg
MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

Product demomixed
Reasoning effort
+2.6pts

Claude Opus 4.7: low → max

87.6 → 90.2 as effort scales

How this benchmark was made

Methodology

The calls are synthetic, immutable benchmark cases with hidden coaching ground truth.

01

Generate cases

Each case starts with two companies, a call type, duration, quality target, web research, and role notes.

02

Write turn by turn

The transcript is produced through many structured LLM calls, one speaker turn at a time, using persona behavior and hidden coaching signals.

03

Judge semantically

Coach models see only the visible case. A judge model sees hidden ground truth and scores grounded coaching quality.

Calls
25
Model configs
18
Judged runs
450
Score axes
8
25 sales calls in the benchmark

Browse the calls

Pick a call to see the hidden answer key and how each model scored on it.

Discovery95.4
CollibraBerkshire Hathaway

Berkshire Hathaway Data governance discovery across decentralized business units with Collibra

flawed·33mEasiest
Competitive displacement94.3
StripePave

Pave Pricing and packaging objection call with Stripe

flawed·18m
Discovery94.1
VercelMercury

Mercury First discovery for frontend platform consolidation with Vercel

flawed·22m
Discovery94.0
AtlassianDelta Air Lines

Delta Air Lines Enterprise discovery for service management modernization with Atlassian

flawed·31m
Product demo93.7
MongoDBWayfair

Wayfair Integration deep dive for catalog modernization with MongoDB

excellent·58m
Renewal save93.7
TwilioThe Home Depot

The Home Depot Renewal save call after usage and support concerns with Twilio

flawed·42m
Product demo93.2
Palo Alto NetworksApple

Apple Technical security review for zero trust architecture with Palo Alto Networks

excellent·66m
QBR92.4
AmplitudeDuolingo

Duolingo Renewal QBR and expansion planning with Amplitude

excellent·52m
Discovery92.0
OpenAICVS Health

CVS Health AI contact-center transformation discovery with OpenAI

excellent·61m
Discovery91.8
GitHubRippling

Rippling Product-led expansion discovery for developer workflow with GitHub

excellent·41m
Discovery91.1
WorkdayMcKesson

McKesson HR transformation qualification and stakeholder mapping with Workday

flawed·27m
Product demo90.9
AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

mixed·39m
Product demo90.8
CrowdStrikeTarget

Target Security architecture review for endpoint consolidation with CrowdStrike

excellent·63m
Product demo90.4
DatadogLinear

Linear Technical demo for observability and incident response with Datadog

excellent·34m
Product demo90.4
ElasticJPMorgan Chase

JPMorgan Chase Technical workshop for search and observability consolidation with Elastic

excellent·74m
Discovery89.3
NVIDIAWalmart

Walmart Executive discovery for AI infrastructure and store operations with NVIDIA

excellent·57m
Discovery89.1
HashiCorpAmazon

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

flawed·26m
Competitive displacement88.6
ServiceNowFord Motor Company

Ford Motor Company Procurement negotiation for workflow automation with ServiceNow

mixed·35m
Product demo87.0
SnowflakeToast

Toast Data platform proof-of-concept kickoff with Snowflake

flawed·44m
Competitive displacement85.8
CloudflareCanva

Canva Competitive displacement discovery for edge security with Cloudflare

flawed·47m
Product demo85.8
FigmaThe Walt Disney Company

The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma

mixed·49m
QBR85.2
OktaSweetgreen

Sweetgreen Executive alignment for identity modernization with Okta

mixed·38m
Renewal save84.9
SalesforceUnitedHealth Group

UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce

mixed·46m
Product demo82.5
SnykRunway

Runway Security review before developer-tool rollout with Snyk

mixed·29m
Product demo79.7
MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

mixed·55mHardest
By model family

Does spending more on reasoning help?

For families with multiple reasoning settings, this shows the average score at each level.

Claude Opus 4.7

+2.6 pts
low
87.6
baseline
medium
89.0
+1.3 pts
high
89.6
+1.9 pts
xhigh
89.4
+1.8 pts
max
90.2
+2.6 pts

GPT-5.4

+1.1 pts
none
90.8
baseline
low
90.3
-0.5 pts
medium
90.9
+0.1 pts
high
92.0
+1.2 pts
xhigh
92.0
+1.1 pts

GPT-5.5

no change
none
92.0
baseline
low
90.8
-1.2 pts
medium
91.7
-0.3 pts
high
91.7
-0.3 pts
xhigh
92.0
+0.0 pts