Skip to results
salesevals.com/Evaluated Jul 1, 2026

Which models know sales?

26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
50
Models
26
Evaluations
1300
Benchmark
86.2
50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026
Ranked by benchmark score

Leaderboard

Weighted for needle recall, sales instinct, prioritization, technical accuracy, false-positive control, and raw score across 1300 evaluations.

1gpt-5.4 xhighLeader
GPT-5.4 · xhigh
89.3
Raw avg
89.0
P10 floor
79.9
Cost / call
$0.07
Calls
n=50
2gpt-5.4 high
GPT-5.4 · high
89.2
Raw avg
89.0
P10 floor
81.9
Cost / call
$0.07
Calls
n=50
3gpt-5.5 medium
GPT-5.5 · medium
89.1
Raw avg
88.8
P10 floor
82.0
Cost / call
$0.17
Calls
n=50
4gpt-5.5 xhigh
GPT-5.5 · xhigh
89.1
Raw avg
89.0
P10 floor
78.5
Cost / call
$0.18
Calls
n=50
5gpt-5.5 high
GPT-5.5 · high
88.9
Raw avg
88.6
P10 floor
79.4
Cost / call
$0.18
Calls
n=50
6gpt-5.4 medium
GPT-5.4 · medium
88.4
Raw avg
88.3
P10 floor
80.1
Cost / call
$0.07
Calls
n=50
7gpt-5.5 none
GPT-5.5 · none
88.3
Raw avg
88.1
P10 floor
76.6
Cost / call
$0.17
Calls
n=50
8gpt-5.5 low
GPT-5.5 · low
87.8
Raw avg
87.7
P10 floor
78.3
Cost / call
$0.17
Calls
n=50
9fable 5 high
Claude Fable 5 · high
87.7
Raw avg
87.5
P10 floor
77.0
Cost / call
$0.32
Calls
n=50
10gpt-5.4 low
GPT-5.4 · low
87.5
Raw avg
87.4
P10 floor
78.3
Cost / call
$0.07
Calls
n=50
11gpt-5.4 none
GPT-5.4 · none
87.5
Raw avg
87.4
P10 floor
78.2
Cost / call
$0.07
Calls
n=50
12opus 4.7 max
Claude Opus 4.7 · max
87.2
Raw avg
87.3
P10 floor
75.0
Cost / call
$0.16
Calls
n=50
13opus 4.7 high
Claude Opus 4.7 · high
86.6
Raw avg
86.8
P10 floor
77.8
Cost / call
$0.13
Calls
n=50
14opus 4.8 medium
Claude Opus 4.8 · medium
85.6
Raw avg
85.8
P10 floor
72.1
Cost / call
$0.11
Calls
n=50
15opus 4.7 medium
Claude Opus 4.7 · medium
85.5
Raw avg
85.6
P10 floor
74.0
Cost / call
$0.11
Calls
n=50
16opus 4.7 xhigh
Claude Opus 4.7 · xhigh
85.5
Raw avg
85.6
P10 floor
73.1
Cost / call
$0.14
Calls
n=50
17opus 4.7 low
Claude Opus 4.7 · low
85.5
Raw avg
85.6
P10 floor
75.6
Cost / call
$0.10
Calls
n=50
18opus 4.8 max
Claude Opus 4.8 · max
85.4
Raw avg
85.4
P10 floor
72.9
Cost / call
$0.15
Calls
n=50
19opus 4.8 xhigh
Claude Opus 4.8 · xhigh
85.3
Raw avg
85.2
P10 floor
73.0
Cost / call
$0.13
Calls
n=50
20opus 4.8 high
Claude Opus 4.8 · high
84.9
Raw avg
84.9
P10 floor
68.8
Cost / call
$0.11
Calls
n=50
21sonnet 4.6
Claude Sonnet 4.6 · default
84.5
Raw avg
84.6
P10 floor
71.9
Cost / call
$0.10
Calls
n=50
22sonnet 5
Claude Sonnet 5 · default
84.3
Raw avg
84.6
P10 floor
71.6
Cost / call
$0.05
Calls
n=50
23opus 4.8 low
Claude Opus 4.8 · low
83.7
Raw avg
84.0
P10 floor
67.2
Cost / call
$0.10
Calls
n=50
24glm 5.2
GLM 5.2 · default
83.6
Raw avg
84.0
P10 floor
71.3
Cost / call
$0.03
Calls
n=50
25deepseek v4 pro
DeepSeek V4 Pro · default
83.1
Raw avg
83.5
P10 floor
69.1
Cost / call
$0.0047
Calls
n=50
26gemini 3.1 pro previewTrailing
Gemini 3.1 Pro Preview · default
78.7
Raw avg
78.9
P10 floor
66.6
Cost / call
$0.03
Calls
n=50
Filtered benchmark86.2
Top model
89.3benchmark

gpt-5.4 xhigh

89.0 raw avg · 79.9 p10 floor · $0.07/call

Hardest call
65.5avg
AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

Product demomixedSonnet-generated
Reasoning effort
+1.8pts

GPT-5.4: none → xhigh

87.5 → 89.3 as effort scales

How this benchmark was made

Methodology

The calls are synthetic, immutable benchmark cases with hidden coaching ground truth.

01

Generate cases

Each case starts with two companies, a call type, duration, quality target, web research, and role notes.

02

Write turn by turn

Half the current calls come from the original GPT-based generator and half from a Claude Sonnet 4.6 generator. Both are written one speaker turn at a time.

03

Judge semantically

Coach models see only the visible case. A judge model sees hidden ground truth, scores eight dimensions, and rolls the sales-critical axes into the benchmark ranking.

Calls
50
Model configs
26
Judged runs
1300
Origins
2
Axes
8
50 sales calls in the benchmark

Browse the calls

Pick a call to see the hidden answer key and model scores.

Discovery95.6
CollibraBerkshire Hathaway

Berkshire Hathaway Data governance discovery across decentralized business units with Collibra

flawed·GPT-generated·33mEasiest
Competitive displacement94.4
StripePave

Pave Pricing and packaging objection call with Stripe

flawed·GPT-generated·18m
Discovery94.0
AtlassianDelta Air Lines

Delta Air Lines Enterprise discovery for service management modernization with Atlassian

flawed·GPT-generated·31m
Discovery93.9
VercelMercury

Mercury First discovery for frontend platform consolidation with Vercel

flawed·GPT-generated·22m
Discovery93.9
WorkdayMcKesson

McKesson HR transformation qualification and stakeholder mapping with Workday

flawed·Sonnet-generated·27m
Renewal save93.8
TwilioThe Home Depot

The Home Depot Renewal save call after usage and support concerns with Twilio

flawed·GPT-generated·42m
Product demo93.3
MongoDBWayfair

Wayfair Integration deep dive for catalog modernization with MongoDB

excellent·GPT-generated·58m
Product demo92.9
Palo Alto NetworksApple

Apple Technical security review for zero trust architecture with Palo Alto Networks

excellent·GPT-generated·66m
QBR92.5
AmplitudeDuolingo

Duolingo Renewal QBR and expansion planning with Amplitude

excellent·GPT-generated·52m
Discovery91.7
WorkdayMcKesson

McKesson HR transformation qualification and stakeholder mapping with Workday

flawed·GPT-generated·27m
Discovery91.7
OpenAICVS Health

CVS Health AI contact-center transformation discovery with OpenAI

excellent·GPT-generated·61m
Discovery91.7
GitHubRippling

Rippling Product-led expansion discovery for developer workflow with GitHub

excellent·GPT-generated·41m
Competitive displacement91.3
CloudflareCanva

Canva Competitive displacement discovery for edge security with Cloudflare

flawed·Sonnet-generated·47m
Discovery90.8
VercelMercury

Mercury First discovery for frontend platform consolidation with Vercel

flawed·Sonnet-generated·22m
Product demo90.2
CrowdStrikeTarget

Target Security architecture review for endpoint consolidation with CrowdStrike

excellent·GPT-generated·63m
Competitive displacement90.2
StripePave

Pave Pricing and packaging objection call with Stripe

flawed·Sonnet-generated·18m
Product demo90.0
DatadogLinear

Linear Technical demo for observability and incident response with Datadog

excellent·GPT-generated·34m
Product demo89.9
AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

mixed·GPT-generated·39m
Product demo89.7
ElasticJPMorgan Chase

JPMorgan Chase Technical workshop for search and observability consolidation with Elastic

excellent·GPT-generated·74m
Product demo89.3
MongoDBWayfair

Wayfair Integration deep dive for catalog modernization with MongoDB

excellent·Sonnet-generated·58m
Discovery89.3
HashiCorpAmazon

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

flawed·GPT-generated·26m
Product demo88.9
MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

mixed·Sonnet-generated·55m
Discovery88.6
NVIDIAWalmart

Walmart Executive discovery for AI infrastructure and store operations with NVIDIA

excellent·GPT-generated·57m
Competitive displacement88.2
ServiceNowFord Motor Company

Ford Motor Company Procurement negotiation for workflow automation with ServiceNow

mixed·GPT-generated·35m
Product demo88.0
CrowdStrikeTarget

Target Security architecture review for endpoint consolidation with CrowdStrike

excellent·Sonnet-generated·63m
Discovery88.0
GitHubRippling

Rippling Product-led expansion discovery for developer workflow with GitHub

excellent·Sonnet-generated·41m
Discovery88.0
OpenAICVS Health

CVS Health AI contact-center transformation discovery with OpenAI

excellent·Sonnet-generated·61m
Product demo86.7
SnowflakeToast

Toast Data platform proof-of-concept kickoff with Snowflake

flawed·GPT-generated·44m
Discovery85.8
NVIDIAWalmart

Walmart Executive discovery for AI infrastructure and store operations with NVIDIA

excellent·Sonnet-generated·57m
Competitive displacement85.2
CloudflareCanva

Canva Competitive displacement discovery for edge security with Cloudflare

flawed·GPT-generated·47m
Discovery84.8
AtlassianDelta Air Lines

Delta Air Lines Enterprise discovery for service management modernization with Atlassian

flawed·Sonnet-generated·31m
Discovery84.8
HashiCorpAmazon

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

flawed·Sonnet-generated·26m
QBR84.7
OktaSweetgreen

Sweetgreen Executive alignment for identity modernization with Okta

mixed·Sonnet-generated·38m
QBR84.3
OktaSweetgreen

Sweetgreen Executive alignment for identity modernization with Okta

mixed·GPT-generated·38m
Product demo84.1
FigmaThe Walt Disney Company

The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma

mixed·GPT-generated·49m
Renewal save83.9
SalesforceUnitedHealth Group

UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce

mixed·GPT-generated·46m
Product demo83.5
SnykRunway

Runway Security review before developer-tool rollout with Snyk

mixed·Sonnet-generated·29m
Product demo83.0
SnykRunway

Runway Security review before developer-tool rollout with Snyk

mixed·GPT-generated·29m
Renewal save81.8
TwilioThe Home Depot

The Home Depot Renewal save call after usage and support concerns with Twilio

flawed·Sonnet-generated·42m
Renewal save81.5
SalesforceUnitedHealth Group

UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce

mixed·Sonnet-generated·46m
Product demo81.0
DatadogLinear

Linear Technical demo for observability and incident response with Datadog

excellent·Sonnet-generated·34m
QBR80.5
AmplitudeDuolingo

Duolingo Renewal QBR and expansion planning with Amplitude

excellent·Sonnet-generated·52m
Product demo80.1
FigmaThe Walt Disney Company

The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma

mixed·Sonnet-generated·49m
Product demo79.1
Palo Alto NetworksApple

Apple Technical security review for zero trust architecture with Palo Alto Networks

excellent·Sonnet-generated·66m
Competitive displacement77.3
ServiceNowFord Motor Company

Ford Motor Company Procurement negotiation for workflow automation with ServiceNow

mixed·Sonnet-generated·35m
Product demo76.7
MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

mixed·GPT-generated·55m
Product demo76.5
SnowflakeToast

Toast Data platform proof-of-concept kickoff with Snowflake

flawed·Sonnet-generated·44m
Product demo71.3
ElasticJPMorgan Chase

JPMorgan Chase Technical workshop for search and observability consolidation with Elastic

excellent·Sonnet-generated·74m
Discovery70.3
CollibraBerkshire Hathaway

Berkshire Hathaway Data governance discovery across decentralized business units with Collibra

flawed·Sonnet-generated·33m
Product demo65.5
AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

mixed·Sonnet-generated·39mHardest
By model family

Does spending more on reasoning help?

For families with multiple reasoning settings, this shows the benchmark score at each level.

Claude Opus 4.7

+1.7 pts
low
85.5
baseline
medium
85.5
+0.0 pts
high
86.6
+1.1 pts
xhigh
85.5
+0.0 pts
max
87.2
+1.7 pts

Claude Opus 4.8

+1.7 pts
low
83.7
baseline
medium
85.6
+1.8 pts
high
84.9
+1.1 pts
xhigh
85.3
+1.6 pts
max
85.4
+1.7 pts

GPT-5.4

+1.8 pts
none
87.5
baseline
low
87.5
+0.0 pts
medium
88.4
+1.0 pts
high
89.2
+1.7 pts
xhigh
89.3
+1.8 pts

GPT-5.5

+0.8 pts
none
88.3
baseline
low
87.8
-0.5 pts
medium
89.1
+0.8 pts
high
88.9
+0.6 pts
xhigh
89.1
+0.8 pts