salesevals.com/Evaluated Jul 1, 2026

Which models know sales?

26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls: 50
Models: 26
Evaluations: 1300
Benchmark: 86.2

50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026

Ranked by benchmark score

Leaderboard

Weighted for needle recall, sales instinct, prioritization, technical accuracy, false-positive control, and raw score across 1300 evaluations.

1gpt-5.4 xhighLeader

GPT-5.4 · xhigh

89.3

Raw avg

89.0

P10 floor

79.9

Cost / call

$0.07

Calls

n=50

2gpt-5.4 high

GPT-5.4 · high

89.2

Raw avg

89.0

P10 floor

81.9

Cost / call

$0.07

Calls

n=50

3gpt-5.5 medium

GPT-5.5 · medium

89.1

Raw avg

88.8

P10 floor

82.0

Cost / call

$0.17

Calls

n=50

4gpt-5.5 xhigh

GPT-5.5 · xhigh

89.1

Raw avg

89.0

P10 floor

78.5

Cost / call

$0.18

Calls

n=50

5gpt-5.5 high

GPT-5.5 · high

88.9

Raw avg

88.6

P10 floor

79.4

Cost / call

$0.18

Calls

n=50

6gpt-5.4 medium

GPT-5.4 · medium

88.4

Raw avg

88.3

P10 floor

80.1

Cost / call

$0.07

Calls

n=50

7gpt-5.5 none

GPT-5.5 · none

88.3

Raw avg

88.1

P10 floor

76.6

Cost / call

$0.17

Calls

n=50

8gpt-5.5 low

GPT-5.5 · low

87.8

Raw avg

87.7

P10 floor

78.3

Cost / call

$0.17

Calls

n=50

9fable 5 high

Claude Fable 5 · high

87.7

Raw avg

87.5

P10 floor

77.0

Cost / call

$0.32

Calls

n=50

10gpt-5.4 low

GPT-5.4 · low

87.5

Raw avg

87.4

P10 floor

78.3

Cost / call

$0.07

Calls

n=50

11gpt-5.4 none

GPT-5.4 · none

87.5

Raw avg

87.4

P10 floor

78.2

Cost / call

$0.07

Calls

n=50

12opus 4.7 max

Claude Opus 4.7 · max

87.2

Raw avg

87.3

P10 floor

75.0

Cost / call

$0.16

Calls

n=50

13opus 4.7 high

Claude Opus 4.7 · high

86.6

Raw avg

86.8

P10 floor

77.8

Cost / call

$0.13

Calls

n=50

14opus 4.8 medium

Claude Opus 4.8 · medium

85.6

Raw avg

85.8

P10 floor

72.1

Cost / call

$0.11

Calls

n=50

15opus 4.7 medium

Claude Opus 4.7 · medium

85.5

Raw avg

85.6

P10 floor

74.0

Cost / call

$0.11

Calls

n=50

16opus 4.7 xhigh

Claude Opus 4.7 · xhigh

85.5

Raw avg

85.6

P10 floor

73.1

Cost / call

$0.14

Calls

n=50

17opus 4.7 low

Claude Opus 4.7 · low

85.5

Raw avg

85.6

P10 floor

75.6

Cost / call

$0.10

Calls

n=50

18opus 4.8 max

Claude Opus 4.8 · max

85.4

Raw avg

85.4

P10 floor

72.9

Cost / call

$0.15

Calls

n=50

19opus 4.8 xhigh

Claude Opus 4.8 · xhigh

85.3

Raw avg

85.2

P10 floor

73.0

Cost / call

$0.13

Calls

n=50

20opus 4.8 high

Claude Opus 4.8 · high

84.9

Raw avg

84.9

P10 floor

68.8

Cost / call

$0.11

Calls

n=50

21sonnet 4.6

Claude Sonnet 4.6 · default

84.5

Raw avg

84.6

P10 floor

71.9

Cost / call

$0.10

Calls

n=50

22sonnet 5

Claude Sonnet 5 · default

84.3

Raw avg

84.6

P10 floor

71.6

Cost / call

$0.05

Calls

n=50

23opus 4.8 low

Claude Opus 4.8 · low

83.7

Raw avg

84.0

P10 floor

67.2

Cost / call

$0.10

Calls

n=50

24glm 5.2

GLM 5.2 · default

83.6

Raw avg

84.0

P10 floor

71.3

Cost / call

$0.03

Calls

n=50

25deepseek v4 pro

DeepSeek V4 Pro · default

83.1

Raw avg

83.5

P10 floor

69.1

Cost / call

$0.0047

Calls

n=50

26gemini 3.1 pro previewTrailing

Gemini 3.1 Pro Preview · default

78.7

Raw avg

78.9

P10 floor

66.6

Cost / call

$0.03

Calls

n=50

Filtered benchmark86.2

Model leaderboard ranked by sales coaching benchmark score, with raw average, downside floor, estimated coach-run cost, score distribution, and sample count.
#	Model	Benchmark	Raw avg	P10 floor	Cost / call	n
1	gpt-5.4 xhighLeader GPT-5.4 · xhigh	89.3	89.0	79.9	$0.07	50
2	gpt-5.4 high GPT-5.4 · high	89.2	89.0	81.9	$0.07	50
3	gpt-5.5 medium GPT-5.5 · medium	89.1	88.8	82.0	$0.17	50
4	gpt-5.5 xhigh GPT-5.5 · xhigh	89.1	89.0	78.5	$0.18	50
5	gpt-5.5 high GPT-5.5 · high	88.9	88.6	79.4	$0.18	50
6	gpt-5.4 medium GPT-5.4 · medium	88.4	88.3	80.1	$0.07	50
7	gpt-5.5 none GPT-5.5 · none	88.3	88.1	76.6	$0.17	50
8	gpt-5.5 low GPT-5.5 · low	87.8	87.7	78.3	$0.17	50
9	fable 5 high Claude Fable 5 · high	87.7	87.5	77.0	$0.32	50
10	gpt-5.4 low GPT-5.4 · low	87.5	87.4	78.3	$0.07	50
11	gpt-5.4 none GPT-5.4 · none	87.5	87.4	78.2	$0.07	50
12	opus 4.7 max Claude Opus 4.7 · max	87.2	87.3	75.0	$0.16	50
13	opus 4.7 high Claude Opus 4.7 · high	86.6	86.8	77.8	$0.13	50
14	opus 4.8 medium Claude Opus 4.8 · medium	85.6	85.8	72.1	$0.11	50
15	opus 4.7 medium Claude Opus 4.7 · medium	85.5	85.6	74.0	$0.11	50
16	opus 4.7 xhigh Claude Opus 4.7 · xhigh	85.5	85.6	73.1	$0.14	50
17	opus 4.7 low Claude Opus 4.7 · low	85.5	85.6	75.6	$0.10	50
18	opus 4.8 max Claude Opus 4.8 · max	85.4	85.4	72.9	$0.15	50
19	opus 4.8 xhigh Claude Opus 4.8 · xhigh	85.3	85.2	73.0	$0.13	50
20	opus 4.8 high Claude Opus 4.8 · high	84.9	84.9	68.8	$0.11	50
21	sonnet 4.6 Claude Sonnet 4.6 · default	84.5	84.6	71.9	$0.10	50
22	sonnet 5 Claude Sonnet 5 · default	84.3	84.6	71.6	$0.05	50
23	opus 4.8 low Claude Opus 4.8 · low	83.7	84.0	67.2	$0.10	50
24	glm 5.2 GLM 5.2 · default	83.6	84.0	71.3	$0.03	50
25	deepseek v4 pro DeepSeek V4 Pro · default	83.1	83.5	69.1	$0.0047	50
26	gemini 3.1 pro previewTrailing Gemini 3.1 Pro Preview · default	78.7	78.9	66.6	$0.03	50
Filtered benchmark		86.2

Top model

89.3benchmark

gpt-5.4 xhigh

89.0 raw avg · 79.9 p10 floor · $0.07/call

Hardest call

65.5avg

AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

Product demomixedSonnet-generated

Reasoning effort

+1.8pts

GPT-5.4: none → xhigh

87.5 → 89.3 as effort scales

How this benchmark was made

Methodology

The calls are synthetic, immutable benchmark cases with hidden coaching ground truth.

Read methodology

Generate cases

Each case starts with two companies, a call type, duration, quality target, web research, and role notes.

Write turn by turn

Half the current calls come from the original GPT-based generator and half from a Claude Sonnet 4.6 generator. Both are written one speaker turn at a time.

Judge semantically

Coach models see only the visible case. A judge model sees hidden ground truth, scores eight dimensions, and rolls the sales-critical axes into the benchmark ranking.

Calls: 50
Model configs: 26
Judged runs: 1300
Origins: 2
Axes: 8

50 sales calls in the benchmark

Browse the calls

Pick a call to see the hidden answer key and model scores.

View full list

Discovery95.6

CollibraBerkshire Hathaway

Berkshire Hathaway Data governance discovery across decentralized business units with Collibra

flawed·GPT-generated·33mEasiest

Competitive displacement94.4

StripePave

Pave Pricing and packaging objection call with Stripe

flawed·GPT-generated·18m

Discovery94.0

AtlassianDelta Air Lines

Delta Air Lines Enterprise discovery for service management modernization with Atlassian

flawed·GPT-generated·31m

Discovery93.9

VercelMercury

Mercury First discovery for frontend platform consolidation with Vercel

flawed·GPT-generated·22m

Discovery93.9

WorkdayMcKesson

McKesson HR transformation qualification and stakeholder mapping with Workday

flawed·Sonnet-generated·27m

Renewal save93.8

TwilioThe Home Depot

The Home Depot Renewal save call after usage and support concerns with Twilio

flawed·GPT-generated·42m

Product demo93.3

MongoDBWayfair

Wayfair Integration deep dive for catalog modernization with MongoDB

excellent·GPT-generated·58m

Product demo92.9

Palo Alto NetworksApple

Apple Technical security review for zero trust architecture with Palo Alto Networks

excellent·GPT-generated·66m

QBR92.5

AmplitudeDuolingo

Duolingo Renewal QBR and expansion planning with Amplitude

excellent·GPT-generated·52m

Discovery91.7

WorkdayMcKesson

McKesson HR transformation qualification and stakeholder mapping with Workday

flawed·GPT-generated·27m

Discovery91.7

OpenAICVS Health

CVS Health AI contact-center transformation discovery with OpenAI

excellent·GPT-generated·61m

Discovery91.7

GitHubRippling

Rippling Product-led expansion discovery for developer workflow with GitHub

excellent·GPT-generated·41m

Competitive displacement91.3

CloudflareCanva

Canva Competitive displacement discovery for edge security with Cloudflare

flawed·Sonnet-generated·47m

Discovery90.8

VercelMercury

Mercury First discovery for frontend platform consolidation with Vercel

flawed·Sonnet-generated·22m

Product demo90.2

CrowdStrikeTarget

Target Security architecture review for endpoint consolidation with CrowdStrike

excellent·GPT-generated·63m

Competitive displacement90.2

StripePave

Pave Pricing and packaging objection call with Stripe

flawed·Sonnet-generated·18m

Product demo90.0

DatadogLinear

Linear Technical demo for observability and incident response with Datadog

excellent·GPT-generated·34m

Product demo89.9

AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

mixed·GPT-generated·39m

Product demo89.7

ElasticJPMorgan Chase

JPMorgan Chase Technical workshop for search and observability consolidation with Elastic

excellent·GPT-generated·74m

Product demo89.3

MongoDBWayfair

Wayfair Integration deep dive for catalog modernization with MongoDB

excellent·Sonnet-generated·58m

Discovery89.3

HashiCorpAmazon

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

flawed·GPT-generated·26m

Product demo88.9

MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

mixed·Sonnet-generated·55m

Discovery88.6

NVIDIAWalmart

Walmart Executive discovery for AI infrastructure and store operations with NVIDIA

excellent·GPT-generated·57m

Competitive displacement88.2

ServiceNowFord Motor Company

Ford Motor Company Procurement negotiation for workflow automation with ServiceNow

mixed·GPT-generated·35m

Product demo88.0

CrowdStrikeTarget

Target Security architecture review for endpoint consolidation with CrowdStrike

excellent·Sonnet-generated·63m

Discovery88.0

GitHubRippling

Rippling Product-led expansion discovery for developer workflow with GitHub

excellent·Sonnet-generated·41m

Discovery88.0

OpenAICVS Health

CVS Health AI contact-center transformation discovery with OpenAI

excellent·Sonnet-generated·61m

Product demo86.7

SnowflakeToast

Toast Data platform proof-of-concept kickoff with Snowflake

flawed·GPT-generated·44m

Discovery85.8

NVIDIAWalmart

Walmart Executive discovery for AI infrastructure and store operations with NVIDIA

excellent·Sonnet-generated·57m

Competitive displacement85.2

CloudflareCanva

Canva Competitive displacement discovery for edge security with Cloudflare

flawed·GPT-generated·47m

Discovery84.8

AtlassianDelta Air Lines

Delta Air Lines Enterprise discovery for service management modernization with Atlassian

flawed·Sonnet-generated·31m

Discovery84.8

HashiCorpAmazon

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

flawed·Sonnet-generated·26m

QBR84.7

OktaSweetgreen

Sweetgreen Executive alignment for identity modernization with Okta

mixed·Sonnet-generated·38m

QBR84.3

OktaSweetgreen

Sweetgreen Executive alignment for identity modernization with Okta

mixed·GPT-generated·38m

Product demo84.1

FigmaThe Walt Disney Company

The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma

mixed·GPT-generated·49m

Renewal save83.9

SalesforceUnitedHealth Group

UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce

mixed·GPT-generated·46m

Product demo83.5

SnykRunway

Runway Security review before developer-tool rollout with Snyk

mixed·Sonnet-generated·29m

Product demo83.0

SnykRunway

Runway Security review before developer-tool rollout with Snyk

mixed·GPT-generated·29m

Renewal save81.8

TwilioThe Home Depot

The Home Depot Renewal save call after usage and support concerns with Twilio

flawed·Sonnet-generated·42m

Renewal save81.5

SalesforceUnitedHealth Group

UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce

mixed·Sonnet-generated·46m

Product demo81.0

DatadogLinear

Linear Technical demo for observability and incident response with Datadog

excellent·Sonnet-generated·34m

QBR80.5

AmplitudeDuolingo

Duolingo Renewal QBR and expansion planning with Amplitude

excellent·Sonnet-generated·52m

Product demo80.1

FigmaThe Walt Disney Company

The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma

mixed·Sonnet-generated·49m

Product demo79.1

Palo Alto NetworksApple

Apple Technical security review for zero trust architecture with Palo Alto Networks

excellent·Sonnet-generated·66m

Competitive displacement77.3

ServiceNowFord Motor Company

Ford Motor Company Procurement negotiation for workflow automation with ServiceNow

mixed·Sonnet-generated·35m

Product demo76.7

MicrosoftCostco Wholesale

Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft

mixed·GPT-generated·55m

Product demo76.5

SnowflakeToast

Toast Data platform proof-of-concept kickoff with Snowflake

flawed·Sonnet-generated·44m

Product demo71.3

ElasticJPMorgan Chase

JPMorgan Chase Technical workshop for search and observability consolidation with Elastic

excellent·Sonnet-generated·74m

Discovery70.3

CollibraBerkshire Hathaway

Berkshire Hathaway Data governance discovery across decentralized business units with Collibra

flawed·Sonnet-generated·33m

Product demo65.5

AnthropicExxonMobil

ExxonMobil AI governance and safety review for energy operations with Anthropic

mixed·Sonnet-generated·39mHardest

By model family

Does spending more on reasoning help?

For families with multiple reasoning settings, this shows the benchmark score at each level.

Claude Opus 4.7

+1.7 pts

low

85.5

baseline

medium

85.5

+0.0 pts

high

86.6

+1.1 pts

xhigh

85.5

+0.0 pts

max

87.2

+1.7 pts

Claude Opus 4.8

+1.7 pts

low

83.7

baseline

medium

85.6

+1.8 pts

high

84.9

+1.1 pts

xhigh

85.3

+1.6 pts

max

85.4

+1.7 pts

GPT-5.4

+1.8 pts

none

87.5

baseline

low

87.5

+0.0 pts

medium

88.4

+1.0 pts

high

89.2

+1.7 pts

xhigh

89.3

+1.8 pts

GPT-5.5

+0.8 pts

none

88.3

baseline

low

87.8

-0.5 pts

medium

89.1

+0.8 pts

high

88.9

+0.6 pts

xhigh

89.1

+0.8 pts