Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
Change vs that model's overall average

Where models do better or worse: by call quality

Average overall per call quality. Green means the model does better than its own baseline on that slice; red means worse.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
ModelOverall
Excellent
n=9
Mixed
n=7
Flawed
n=9
gpt-5.4 high
GPT-5.4
92.0
92.7+0.6
90.1-1.9
92.9+0.8
gpt-5.5 none
GPT-5.5
92.0
94.0+2.0
87.3-4.7
93.7+1.7
gpt-5.5 xhigh
GPT-5.5
92.0
94.2+2.2
88.9-3.1
92.2+0.2
gpt-5.4 xhigh
GPT-5.4
92.0
92.3+0.4
91.1-0.8
92.2+0.3
gpt-5.5 high
GPT-5.5
91.7
93.9+2.2
86.9-4.9
93.3+1.6
gpt-5.5 medium
GPT-5.5
91.7
93.8+2.1
88.0-3.7
92.4+0.8
gpt-5.4 medium
GPT-5.4
90.9
90.1-0.8
88.1-2.8
93.9+3.0
gpt-5.4 none
GPT-5.4
90.8
91.3+0.5
89.0-1.8
91.8+0.9
gpt-5.5 low
GPT-5.5
90.8
93.8+3.0
86.9-3.9
90.9+0.1
gpt-5.4 low
GPT-5.4
90.3
91.8+1.5
88.3-2.0
90.4+0.1
opus 4.7 max
Claude Opus 4.7
90.2
91.2+1.0
85.4-4.8
93.0+2.8
opus 4.7 high
Claude Opus 4.7
89.6
91.3+1.8
83.9-5.7
92.2+2.7
opus 4.7 xhigh
Claude Opus 4.7
89.4
91.4+2.0
83.3-6.2
92.2+2.8
opus 4.7 medium
Claude Opus 4.7
89.0
90.9+1.9
82.6-6.4
92.0+3.0
sonnet 4.6
Claude Sonnet 4.6
88.8
88.90.0
86.1-2.7
90.9+2.0
opus 4.7 low
Claude Opus 4.7
87.6
88.9+1.2
82.0-5.6
90.8+3.1
deepseek v4 pro
DeepSeek V4 Pro
85.8
90.2+4.5
76.4-9.3
88.6+2.8
gemini 3.1 pro preview
Gemini 3.1 Pro Preview
82.4
87.2+4.8
72.4-10.0
85.4+3.0
Change vs that model's overall average

Where models do better or worse: by call type

Same view, sliced by what kind of call it was.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
ModelOverall
Discovery
n=8
Product demo
n=10
Renewal save
n=2
QBR
n=2
Competitive displacement
n=3
gpt-5.4 high
GPT-5.4
92.0
93.5+1.5
91.2-0.8
93.0+1.0
90.5-1.5
91.3-0.7
gpt-5.5 none
GPT-5.5
92.0
93.9+1.9
90.8-1.2
91.0-1.0
89.5-2.5
93.3+1.3
gpt-5.5 xhigh
GPT-5.5
92.0
93.1+1.1
91.5-0.5
91.5-0.5
92.00.0
91.0-1.0
gpt-5.4 xhigh
GPT-5.4
92.0
91.8-0.2
91.8-0.2
92.00.0
92.00.0
93.0+1.0
gpt-5.5 high
GPT-5.5
91.7
93.5+1.8
90.6-1.1
88.5-3.2
90.5-1.2
93.7+1.9
gpt-5.5 medium
GPT-5.5
91.7
93.4+1.7
91.0-0.7
89.0-2.7
93.0+1.3
90.3-1.3
gpt-5.4 medium
GPT-5.4
90.9
92.9+2.0
89.3-1.6
90.5-0.4
91.0+0.1
91.3+0.4
gpt-5.4 none
GPT-5.4
90.8
92.5+1.7
89.5-1.3
93.5+2.7
91.0+0.2
89.0-1.8
gpt-5.5 low
GPT-5.5
90.8
92.0+1.2
90.5-0.3
89.0-1.8
89.0-1.8
91.0+0.2
gpt-5.4 low
GPT-5.4
90.3
91.3+0.9
90.4+0.1
86.5-3.8
91.5+1.2
89.3-1.0
opus 4.7 max
Claude Opus 4.7
90.2
93.6+3.4
88.0-2.2
92.0+1.8
88.5-1.7
88.7-1.6
opus 4.7 high
Claude Opus 4.7
89.6
93.1+3.6
86.9-2.7
91.5+1.9
86.5-3.1
89.7+0.1
opus 4.7 xhigh
Claude Opus 4.7
89.4
92.4+2.9
87.5-1.9
88.5-0.9
86.0-3.4
91.0+1.6
opus 4.7 medium
Claude Opus 4.7
89.0
92.5+3.5
85.4-3.6
91.5+2.5
90.0+1.0
89.00.0
sonnet 4.6
Claude Sonnet 4.6
88.8
89.8+0.9
88.2-0.6
90.0+1.2
86.0-2.8
89.7+0.8
opus 4.7 low
Claude Opus 4.7
87.6
91.6+4.0
84.6-3.0
90.0+2.4
86.0-1.6
86.7-1.0
deepseek v4 pro
DeepSeek V4 Pro
85.8
91.1+5.4
84.6-1.2
79.0-6.8
83.5-2.3
81.3-4.4
gemini 3.1 pro preview
Gemini 3.1 Pro Preview
82.4
86.0+3.6
80.1-2.3
80.5-1.9
82.0-0.4
82.3-0.1