Skip to results
salesevals.com/Evaluated Jul 1, 2026

Which models know sales?

26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
50
Models
26
Evaluations
1300
Benchmark
86.2
50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026
Change vs that model's overall average

Where models do better or worse: by call quality

Average overall per call quality. Green means the model does better than its own baseline on that slice; red means worse.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
ModelOverall
Excellent
n=18
Mixed
n=14
Flawed
n=18
gpt-5.4 xhigh
GPT-5.4
89.0
88.9-0.1
87.6-1.3
90.1+1.1
gpt-5.4 high
GPT-5.4
89.0
88.2-0.7
87.7-1.2
90.7+1.7
gpt-5.5 medium
GPT-5.5
88.8
90.1+1.2
86.7-2.1
89.3+0.4
gpt-5.5 xhigh
GPT-5.5
89.0
89.8+0.8
87.7-1.3
89.2+0.2
gpt-5.5 high
GPT-5.5
88.6
90.1+1.6
84.5-4.1
90.2+1.6
gpt-5.4 medium
GPT-5.4
88.3
88.1-0.2
85.6-2.7
90.6+2.3
gpt-5.5 none
GPT-5.5
88.1
89.6+1.5
85.0-3.1
89.1+1.0
gpt-5.5 low
GPT-5.5
87.7
89.4+1.7
85.6-2.1
87.70.0
fable 5 high
Claude Fable 5
87.5
86.9-0.6
85.9-1.6
89.3+1.8
gpt-5.4 low
GPT-5.4
87.4
87.9+0.6
84.8-2.6
88.8+1.4
gpt-5.4 none
GPT-5.4
87.4
87.9+0.5
84.5-2.9
89.1+1.7
opus 4.7 max
Claude Opus 4.7
87.3
88.8+1.6
83.6-3.6
88.5+1.2
opus 4.7 high
Claude Opus 4.7
86.8
87.1+0.3
83.4-3.5
89.2+2.4
opus 4.8 medium
Claude Opus 4.8
85.8
86.6+0.8
81.1-4.7
88.6+2.9
opus 4.7 medium
Claude Opus 4.7
85.6
87.2+1.6
80.0-5.6
88.4+2.8
opus 4.7 xhigh
Claude Opus 4.7
85.6
86.8+1.2
81.1-4.5
87.9+2.3
opus 4.7 low
Claude Opus 4.7
85.6
86.6+1.0
80.8-4.8
88.4+2.8
opus 4.8 max
Claude Opus 4.8
85.4
85.9+0.5
80.4-5.0
88.8+3.4
opus 4.8 xhigh
Claude Opus 4.8
85.2
88.1+2.8
79.4-5.9
87.0+1.8
opus 4.8 high
Claude Opus 4.8
84.9
87.3+2.4
77.4-7.5
88.4+3.5
sonnet 4.6
Claude Sonnet 4.6
84.6
84.7+0.1
80.1-4.4
87.9+3.3
sonnet 5
Claude Sonnet 5
84.6
84.0-0.6
83.4-1.2
86.2+1.6
opus 4.8 low
Claude Opus 4.8
84.0
86.7+2.7
77.0-7.0
86.7+2.7
glm 5.2
GLM 5.2
84.0
85.8+1.8
79.2-4.8
85.9+1.9
deepseek v4 pro
DeepSeek V4 Pro
83.5
86.2+2.7
76.6-6.9
86.2+2.7
gemini 3.1 pro preview
Gemini 3.1 Pro Preview
78.9
81.6+2.7
69.8-9.1
83.3+4.4
Change vs that model's overall average

Where models do better or worse: by call type

Same view, sliced by what kind of call it was.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
ModelOverall
Discovery
n=16
Product demo
n=20
Renewal save
n=4
QBR
n=4
Competitive displacement
n=6
gpt-5.4 xhigh
GPT-5.4
89.0
89.7+0.7
88.5-0.5
86.1-2.9
88.0-1.0
91.5+2.5
gpt-5.4 high
GPT-5.4
89.0
90.4+1.5
87.6-1.4
89.8+0.8
87.5-1.5
90.0+1.0
gpt-5.5 medium
GPT-5.5
88.8
89.9+1.1
87.9-0.9
87.0-1.8
90.8+1.9
89.0+0.2
gpt-5.5 xhigh
GPT-5.5
89.0
89.3+0.2
88.4-0.6
89.8+0.7
88.8-0.3
90.2+1.1
gpt-5.5 high
GPT-5.5
88.6
90.4+1.8
87.8-0.8
80.8-7.8
87.8-0.8
92.0+3.4
gpt-5.4 medium
GPT-5.4
88.3
90.1+1.8
87.3-0.9
84.3-4.0
86.8-1.5
90.2+1.9
gpt-5.5 none
GPT-5.5
88.1
90.1+2.0
86.5-1.7
86.0-2.1
88.0-0.1
90.0+1.9
gpt-5.5 low
GPT-5.5
87.7
89.0+1.3
86.5-1.2
87.0-0.7
86.5-1.2
89.3+1.6
fable 5 high
Claude Fable 5
87.5
89.7+2.2
84.8-2.7
87.3-0.3
88.5+1.0
90.2+2.6
gpt-5.4 low
GPT-5.4
87.4
89.3+1.9
86.5-0.9
86.3-1.1
87.0-0.4
86.2-1.2
gpt-5.4 none
GPT-5.4
87.4
88.9+1.6
86.5-0.9
86.3-1.1
86.5-0.9
87.5+0.1
opus 4.7 max
Claude Opus 4.7
87.3
91.3+4.1
84.5-2.8
89.3+2.0
85.8-1.5
85.5-1.8
opus 4.7 high
Claude Opus 4.7
86.8
90.1+3.2
83.7-3.2
86.8-0.1
86.3-0.6
89.2+2.3
opus 4.8 medium
Claude Opus 4.8
85.8
89.4+3.6
81.5-4.2
87.3+1.5
86.5+0.7
88.7+2.9
opus 4.7 medium
Claude Opus 4.7
85.6
89.8+4.2
81.6-4.0
86.3+0.6
84.8-0.9
88.0+2.4
opus 4.7 xhigh
Claude Opus 4.7
85.6
87.9+2.3
83.0-2.5
86.3+0.7
84.0-1.6
88.5+2.9
opus 4.7 low
Claude Opus 4.7
85.6
90.2+4.6
82.0-3.6
86.8+1.1
84.5-1.1
85.5-0.1
opus 4.8 max
Claude Opus 4.8
85.4
88.8+3.4
81.3-4.1
84.8-0.7
87.8+2.3
88.8+3.4
opus 4.8 xhigh
Claude Opus 4.8
85.2
88.5+3.3
81.5-3.8
82.8-2.5
86.8+1.5
89.8+4.6
opus 4.8 high
Claude Opus 4.8
84.9
89.9+5.0
81.4-3.5
78.5-6.4
86.0+1.1
86.8+1.9
sonnet 4.6
Claude Sonnet 4.6
84.6
86.9+2.3
83.3-1.2
82.0-2.6
82.0-2.6
85.8+1.3
sonnet 5
Claude Sonnet 5
84.6
86.6+2.0
82.0-2.6
88.0+3.4
82.8-1.8
87.2+2.6
opus 4.8 low
Claude Opus 4.8
84.0
88.1+4.1
80.5-3.5
85.8+1.8
81.8-2.2
85.0+1.0
glm 5.2
GLM 5.2
84.0
86.3+2.3
83.0-1.0
85.5+1.5
79.5-4.5
83.2-0.8
deepseek v4 pro
DeepSeek V4 Pro
83.5
87.4+3.9
81.5-2.0
82.0-1.5
82.3-1.3
81.8-1.7
gemini 3.1 pro preview
Gemini 3.1 Pro Preview
78.9
84.4+5.5
74.7-4.2
74.3-4.7
77.5-1.4
82.3+3.4
GPT-generated vs Sonnet-generated calls

Where models do better or worse: by transcript generator

This slice keeps the default leaderboard behavior: all available runs are included, and each cell shows the available score for that transcript-origin bucket.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
ModelOverall
GPT-generated
n=25
Sonnet-generated
n=25
gpt-5.4 xhigh
GPT-5.4
89.0
92.0+3.0
86.0-3.0
gpt-5.4 high
GPT-5.4
89.0
92.0+3.1
85.9-3.1
gpt-5.5 medium
GPT-5.5
88.8
91.7+2.8
86.0-2.8
gpt-5.5 xhigh
GPT-5.5
89.0
92.0+3.0
86.0-3.0
gpt-5.5 high
GPT-5.5
88.6
91.7+3.2
85.4-3.2
gpt-5.4 medium
GPT-5.4
88.3
90.9+2.6
85.6-2.6
gpt-5.5 none
GPT-5.5
88.1
92.0+3.9
84.3-3.9
gpt-5.5 low
GPT-5.5
87.7
90.8+3.1
84.6-3.1
fable 5 high
Claude Fable 5
87.5
91.0+3.4
84.1-3.4
gpt-5.4 low
GPT-5.4
87.4
90.3+3.0
84.4-3.0
gpt-5.4 none
GPT-5.4
87.4
90.8+3.5
83.9-3.5
opus 4.7 max
Claude Opus 4.7
87.3
90.2+3.0
84.3-3.0
opus 4.7 high
Claude Opus 4.7
86.8
89.6+2.7
84.1-2.7
opus 4.8 medium
Claude Opus 4.8
85.8
89.2+3.4
82.4-3.4
opus 4.7 medium
Claude Opus 4.7
85.6
89.0+3.3
82.3-3.3
opus 4.7 xhigh
Claude Opus 4.7
85.6
89.4+3.8
81.8-3.8
opus 4.7 low
Claude Opus 4.7
85.6
87.6+2.0
83.6-2.0
opus 4.8 max
Claude Opus 4.8
85.4
88.6+3.1
82.3-3.1
opus 4.8 xhigh
Claude Opus 4.8
85.2
88.1+2.9
82.4-2.9
opus 4.8 high
Claude Opus 4.8
84.9
89.4+4.5
80.4-4.5
sonnet 4.6
Claude Sonnet 4.6
84.6
88.8+4.3
80.3-4.3
sonnet 5
Claude Sonnet 5
84.6
87.4+2.8
81.8-2.8
opus 4.8 low
Claude Opus 4.8
84.0
88.0+4.0
80.0-4.0
glm 5.2
GLM 5.2
84.0
86.1+2.1
81.8-2.1
deepseek v4 pro
DeepSeek V4 Pro
83.5
85.8+2.3
81.2-2.3
gemini 3.1 pro preview
Gemini 3.1 Pro Preview
78.9
82.4+3.5
75.4-3.5