salesevals.com/Evaluated Apr 30, 2026
Which models know sales?
Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 25
- Models
- 18
- Evaluations
- 450
- Mean
- 89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
All eight scoring axes
How each model scores on every dimension
Cells shaded by score — greener is better. Models sorted by overall average.
| Model | Overall | Needle recall | Evidence grounding | False-positive control | Prioritization | Actionability | Sales instinct | Technical accuracy |
|---|---|---|---|---|---|---|---|---|
gpt-5.4 high GPT-5.4 · high | 92.0 | 93.4 | 94.6 | 91.6 | 90.9 | 94.2 | 93.0 | 93.7 |
gpt-5.5 none GPT-5.5 · none | 92.0 | 93.8 | 94.9 | 90.2 | 90.1 | 94.1 | 92.3 | 93.4 |
gpt-5.5 xhigh GPT-5.5 · xhigh | 92.0 | 94.1 | 95.2 | 90.6 | 90.5 | 94.4 | 92.0 | 93.8 |
gpt-5.4 xhigh GPT-5.4 · xhigh | 92.0 | 93.2 | 95.0 | 91.5 | 90.8 | 94.2 | 92.9 | 93.3 |
gpt-5.5 high GPT-5.5 · high | 91.7 | 93.8 | 94.5 | 90.9 | 89.2 | 94.5 | 92.1 | 93.4 |
gpt-5.5 medium GPT-5.5 · medium | 91.7 | 93.4 | 94.4 | 88.9 | 90.6 | 94.6 | 92.5 | 93.2 |
gpt-5.4 medium GPT-5.4 · medium | 90.9 | 91.7 | 94.3 | 89.8 | 88.9 | 93.5 | 91.7 | 92.9 |
gpt-5.4 none GPT-5.4 · none | 90.8 | 91.9 | 93.9 | 88.8 | 89.4 | 92.9 | 91.1 | 93.1 |
gpt-5.5 low GPT-5.5 · low | 90.8 | 92.6 | 94.0 | 88.9 | 89.6 | 93.8 | 91.6 | 92.3 |
gpt-5.4 low GPT-5.4 · low | 90.3 | 91.0 | 93.2 | 89.1 | 88.8 | 93.4 | 90.9 | 92.4 |
opus 4.7 max Claude Opus 4.7 · max | 90.2 | 92.1 | 92.6 | 86.0 | 88.5 | 93.6 | 91.1 | 91.8 |
opus 4.7 high Claude Opus 4.7 · high | 89.6 | 90.7 | 91.9 | 85.8 | 87.6 | 92.9 | 90.3 | 91.5 |
opus 4.7 xhigh Claude Opus 4.7 · xhigh | 89.4 | 91.2 | 91.8 | 86.2 | 87.3 | 92.8 | 90.2 | 91.6 |
opus 4.7 medium Claude Opus 4.7 · medium | 89.0 | 90.1 | 91.7 | 86.2 | 85.8 | 92.3 | 89.7 | 90.9 |
sonnet 4.6 Claude Sonnet 4.6 · default | 88.8 | 90.8 | 89.8 | 83.1 | 87.1 | 92.5 | 90.4 | 89.7 |
opus 4.7 low Claude Opus 4.7 · low | 87.6 | 88.0 | 90.8 | 84.0 | 84.8 | 90.8 | 87.8 | 90.6 |
deepseek v4 pro DeepSeek V4 Pro · default | 85.8 | 86.1 | 88.9 | 81.2 | 83.3 | 88.9 | 86.3 | 88.0 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview · default | 82.4 | 79.6 | 88.1 | 79.8 | 80.2 | 85.6 | 84.0 | 86.2 |
| Mean | 89.8 | 91.0 | 92.8 | 87.4 | 88.0 | 92.7 | 90.5 | 91.8 |
Coach model only
Estimated coach-run cost
Estimated from the saved visible prompt, saved coach output, and AI Gateway listed input/output rates. Generation and judging are excluded.
- Est. coach spend
- $49.30
- Median / call
- $0.10
- Rows
- 18
Model
Score
Cost / call
25-call cost
Avg input
Avg output
Rate in / out
deepseek v4 pro
DeepSeek V4 Pro · default
Score
85.8
Cost / call
$0.0047
25-call cost
$0.12
Avg input
4,521
Avg output
3,187
Rate in / out
$0.43 / $0.87
gemini 3.1 pro preview
Gemini 3.1 Pro Preview · default
Score
82.4
Cost / call
$0.03
25-call cost
$0.80
Avg input
4,521
Avg output
1,918
Rate in / out
$2.00 / $12.00
gpt-5.4 none
GPT-5.4 · none
Score
90.8
Cost / call
$0.07
25-call cost
$1.73
Avg input
4,521
Avg output
3,865
Rate in / out
$2.50 / $15.00
gpt-5.4 xhigh
GPT-5.4 · xhigh
Score
92.0
Cost / call
$0.07
25-call cost
$1.81
Avg input
4,521
Avg output
4,073
Rate in / out
$2.50 / $15.00
gpt-5.4 medium
GPT-5.4 · medium
Score
90.9
Cost / call
$0.07
25-call cost
$1.82
Avg input
4,521
Avg output
4,106
Rate in / out
$2.50 / $15.00
gpt-5.4 high
GPT-5.4 · high
Score
92.0
Cost / call
$0.07
25-call cost
$1.85
Avg input
4,521
Avg output
4,177
Rate in / out
$2.50 / $15.00
gpt-5.4 low
GPT-5.4 · low
Score
90.3
Cost / call
$0.07
25-call cost
$1.85
Avg input
4,521
Avg output
4,180
Rate in / out
$2.50 / $15.00
opus 4.7 low
Claude Opus 4.7 · low
Score
87.6
Cost / call
$0.09
25-call cost
$2.33
Avg input
4,521
Avg output
2,820
Rate in / out
$5.00 / $25.00
sonnet 4.6
Claude Sonnet 4.6 · default
Score
88.8
Cost / call
$0.10
25-call cost
$2.59
Avg input
4,521
Avg output
5,993
Rate in / out
$3.00 / $15.00
opus 4.7 medium
Claude Opus 4.7 · medium
Score
89.0
Cost / call
$0.10
25-call cost
$2.59
Avg input
4,521
Avg output
3,244
Rate in / out
$5.00 / $25.00
opus 4.7 high
Claude Opus 4.7 · high
Score
89.6
Cost / call
$0.12
25-call cost
$3.03
Avg input
4,521
Avg output
3,946
Rate in / out
$5.00 / $25.00
opus 4.7 xhigh
Claude Opus 4.7 · xhigh
Score
89.4
Cost / call
$0.13
25-call cost
$3.25
Avg input
4,521
Avg output
4,293
Rate in / out
$5.00 / $25.00
opus 4.7 max
Claude Opus 4.7 · max
Score
90.2
Cost / call
$0.15
25-call cost
$3.82
Avg input
4,521
Avg output
5,210
Rate in / out
$5.00 / $25.00
gpt-5.5 none
GPT-5.5 · none
Score
92.0
Cost / call
$0.17
25-call cost
$4.16
Avg input
4,521
Avg output
4,791
Rate in / out
$5.00 / $30.00
gpt-5.5 low
GPT-5.5 · low
Score
90.8
Cost / call
$0.17
25-call cost
$4.33
Avg input
4,521
Avg output
5,014
Rate in / out
$5.00 / $30.00
gpt-5.5 medium
GPT-5.5 · medium
Score
91.7
Cost / call
$0.17
25-call cost
$4.34
Avg input
4,521
Avg output
5,027
Rate in / out
$5.00 / $30.00
gpt-5.5 high
GPT-5.5 · high
Score
91.7
Cost / call
$0.18
25-call cost
$4.42
Avg input
4,521
Avg output
5,138
Rate in / out
$5.00 / $30.00
gpt-5.5 xhigh
GPT-5.5 · xhigh
Score
92.0
Cost / call
$0.18
25-call cost
$4.47
Avg input
4,521
Avg output
5,206
Rate in / out
$5.00 / $30.00
Token counts use the saved coach prompt and saved structured output with a character-count estimate. Exact Gateway usage, cache reads, reasoning tokens, and any repair retries were not persisted, so this should be read as directional cost per coaching result.