salesevals.com/Evaluated Jul 1, 2026
Which models know sales?
26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 50
- Models
- 26
- Evaluations
- 1300
- Benchmark
- 86.2
50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026
All eight scoring axes
How each model scores on every dimension
Cells shaded by score — greener is better. Models sorted by benchmark score.
| Model | Overall | Needle recall | Evidence grounding | False-positive control | Prioritization | Actionability | Sales instinct | Technical accuracy |
|---|---|---|---|---|---|---|---|---|
gpt-5.4 xhigh GPT-5.4 · xhigh | 89.0 | 88.1 | 93.9 | 89.6 | 88.2 | 92.9 | 90.6 | 91.5 |
gpt-5.4 high GPT-5.4 · high | 89.0 | 87.6 | 93.4 | 89.9 | 88.0 | 92.8 | 90.6 | 91.3 |
gpt-5.5 medium GPT-5.5 · medium | 88.8 | 88.5 | 93.7 | 88.0 | 87.8 | 93.3 | 90.6 | 91.2 |
gpt-5.5 xhigh GPT-5.5 · xhigh | 89.0 | 88.5 | 93.9 | 89.2 | 87.6 | 93.3 | 89.9 | 91.3 |
gpt-5.5 high GPT-5.5 · high | 88.6 | 88.4 | 93.6 | 89.0 | 86.7 | 93.4 | 90.2 | 91.1 |
gpt-5.4 medium GPT-5.4 · medium | 88.3 | 86.5 | 93.3 | 88.8 | 87.2 | 92.3 | 90.0 | 90.9 |
gpt-5.5 none GPT-5.5 · none | 88.1 | 86.8 | 93.5 | 87.9 | 86.9 | 92.9 | 90.0 | 91.1 |
gpt-5.5 low GPT-5.5 · low | 87.7 | 87.0 | 92.9 | 87.1 | 86.3 | 92.4 | 89.2 | 90.4 |
fable 5 high Claude Fable 5 · high | 87.5 | 86.9 | 90.1 | 83.5 | 86.7 | 93.0 | 90.4 | 89.6 |
gpt-5.4 low GPT-5.4 · low | 87.4 | 86.0 | 92.1 | 86.8 | 86.3 | 91.8 | 89.0 | 90.4 |
gpt-5.4 none GPT-5.4 · none | 87.4 | 85.8 | 92.7 | 86.7 | 86.7 | 91.4 | 88.7 | 90.5 |
opus 4.7 max Claude Opus 4.7 · max | 87.3 | 86.9 | 90.6 | 83.5 | 85.9 | 92.7 | 89.1 | 89.3 |
opus 4.7 high Claude Opus 4.7 · high | 86.8 | 86.1 | 89.0 | 82.4 | 85.3 | 92.1 | 88.8 | 88.8 |
opus 4.8 medium Claude Opus 4.8 · medium | 85.8 | 84.7 | 89.7 | 81.8 | 83.8 | 90.5 | 88.1 | 88.2 |
opus 4.7 medium Claude Opus 4.7 · medium | 85.6 | 84.2 | 89.1 | 82.6 | 83.9 | 91.0 | 88.2 | 87.8 |
opus 4.7 xhigh Claude Opus 4.7 · xhigh | 85.6 | 84.6 | 89.0 | 82.3 | 83.9 | 91.6 | 87.8 | 88.0 |
opus 4.7 low Claude Opus 4.7 · low | 85.6 | 84.1 | 89.6 | 82.7 | 84.1 | 90.9 | 87.6 | 88.5 |
opus 4.8 max Claude Opus 4.8 · max | 85.4 | 85.4 | 88.6 | 80.9 | 83.8 | 91.4 | 87.6 | 88.2 |
opus 4.8 xhigh Claude Opus 4.8 · xhigh | 85.2 | 84.8 | 88.7 | 81.3 | 83.8 | 90.4 | 87.6 | 88.3 |
opus 4.8 high Claude Opus 4.8 · high | 84.9 | 83.6 | 89.2 | 81.1 | 83.7 | 90.5 | 87.1 | 88.6 |
sonnet 4.6 Claude Sonnet 4.6 · default | 84.6 | 83.8 | 87.0 | 79.6 | 83.2 | 91.2 | 87.6 | 86.6 |
sonnet 5 Claude Sonnet 5 · default | 84.6 | 83.8 | 88.5 | 81.0 | 82.3 | 89.3 | 85.9 | 87.8 |
opus 4.8 low Claude Opus 4.8 · low | 84.0 | 82.8 | 88.5 | 80.4 | 81.9 | 89.3 | 85.5 | 87.5 |
glm 5.2 GLM 5.2 · default | 84.0 | 82.2 | 88.4 | 80.8 | 81.6 | 89.4 | 85.7 | 87.3 |
deepseek v4 pro DeepSeek V4 Pro · default | 83.5 | 81.9 | 86.9 | 79.5 | 81.9 | 88.5 | 84.9 | 86.6 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview · default | 78.9 | 74.5 | 86.2 | 78.0 | 76.9 | 84.1 | 81.4 | 84.1 |
| Mean | 86.3 | 85.1 | 90.5 | 84.0 | 84.8 | 91.2 | 88.2 | 89.0 |
Coach model only
Estimated coach-run cost
Estimated from the saved visible prompt, saved coach output, and AI Gateway listed input/output rates. Generation and judging are excluded.
- Est. coach spend
- $150.89
- Median / call
- $0.11
- Rows
- 26
Model
Score
Cost / call
Total cost
Avg input
Avg output
Rate in / out
deepseek v4 pro
DeepSeek V4 Pro · default
Score
83.5
Cost / call
$0.0047
Total cost
$0.24
Avg input
4,559
Avg output
3,180
Rate in / out
$0.43 / $0.87
glm 5.2
GLM 5.2 · default
Score
84.0
Cost / call
$0.03
Total cost
$1.30
Avg input
4,559
Avg output
4,455
Rate in / out
$1.40 / $4.40
gemini 3.1 pro preview
Gemini 3.1 Pro Preview · default
Score
78.9
Cost / call
$0.03
Total cost
$1.60
Avg input
4,559
Avg output
1,900
Rate in / out
$2.00 / $12.00
sonnet 5
Claude Sonnet 5 · default
Score
84.6
Cost / call
$0.05
Total cost
$2.50
Avg input
4,559
Avg output
4,079
Rate in / out
$2.00 / $10.00
gpt-5.4 none
GPT-5.4 · none
Score
87.4
Cost / call
$0.07
Total cost
$3.47
Avg input
4,559
Avg output
3,865
Rate in / out
$2.50 / $15.00
gpt-5.4 xhigh
GPT-5.4 · xhigh
Score
89.0
Cost / call
$0.07
Total cost
$3.65
Avg input
4,559
Avg output
4,110
Rate in / out
$2.50 / $15.00
gpt-5.4 medium
GPT-5.4 · medium
Score
88.3
Cost / call
$0.07
Total cost
$3.66
Avg input
4,559
Avg output
4,115
Rate in / out
$2.50 / $15.00
gpt-5.4 high
GPT-5.4 · high
Score
89.0
Cost / call
$0.07
Total cost
$3.67
Avg input
4,559
Avg output
4,133
Rate in / out
$2.50 / $15.00
gpt-5.4 low
GPT-5.4 · low
Score
87.4
Cost / call
$0.07
Total cost
$3.68
Avg input
4,559
Avg output
4,142
Rate in / out
$2.50 / $15.00
opus 4.8 low
Claude Opus 4.8 · low
Score
84.0
Cost / call
$0.10
Total cost
$4.76
Avg input
4,559
Avg output
2,893
Rate in / out
$5.00 / $25.00
opus 4.7 low
Claude Opus 4.7 · low
Score
85.6
Cost / call
$0.10
Total cost
$4.93
Avg input
4,559
Avg output
3,034
Rate in / out
$5.00 / $25.00
sonnet 4.6
Claude Sonnet 4.6 · default
Score
84.6
Cost / call
$0.10
Total cost
$5.21
Avg input
4,559
Avg output
6,034
Rate in / out
$3.00 / $15.00
opus 4.8 medium
Claude Opus 4.8 · medium
Score
85.8
Cost / call
$0.11
Total cost
$5.38
Avg input
4,559
Avg output
3,389
Rate in / out
$5.00 / $25.00
opus 4.7 medium
Claude Opus 4.7 · medium
Score
85.6
Cost / call
$0.11
Total cost
$5.52
Avg input
4,559
Avg output
3,508
Rate in / out
$5.00 / $25.00
opus 4.8 high
Claude Opus 4.8 · high
Score
84.9
Cost / call
$0.11
Total cost
$5.59
Avg input
4,559
Avg output
3,563
Rate in / out
$5.00 / $25.00
opus 4.7 high
Claude Opus 4.7 · high
Score
86.8
Cost / call
$0.13
Total cost
$6.48
Avg input
4,559
Avg output
4,273
Rate in / out
$5.00 / $25.00
opus 4.8 xhigh
Claude Opus 4.8 · xhigh
Score
85.2
Cost / call
$0.13
Total cost
$6.63
Avg input
4,559
Avg output
4,393
Rate in / out
$5.00 / $25.00
opus 4.7 xhigh
Claude Opus 4.7 · xhigh
Score
85.6
Cost / call
$0.14
Total cost
$7.07
Avg input
4,559
Avg output
4,747
Rate in / out
$5.00 / $25.00
opus 4.8 max
Claude Opus 4.8 · max
Score
85.4
Cost / call
$0.15
Total cost
$7.55
Avg input
4,559
Avg output
5,127
Rate in / out
$5.00 / $25.00
opus 4.7 max
Claude Opus 4.7 · max
Score
87.3
Cost / call
$0.16
Total cost
$8.11
Avg input
4,559
Avg output
5,574
Rate in / out
$5.00 / $25.00
gpt-5.5 none
GPT-5.5 · none
Score
88.1
Cost / call
$0.17
Total cost
$8.45
Avg input
4,559
Avg output
4,876
Rate in / out
$5.00 / $30.00
gpt-5.5 low
GPT-5.5 · low
Score
87.7
Cost / call
$0.17
Total cost
$8.66
Avg input
4,559
Avg output
5,015
Rate in / out
$5.00 / $30.00
gpt-5.5 medium
GPT-5.5 · medium
Score
88.8
Cost / call
$0.17
Total cost
$8.67
Avg input
4,559
Avg output
5,021
Rate in / out
$5.00 / $30.00
gpt-5.5 xhigh
GPT-5.5 · xhigh
Score
89.0
Cost / call
$0.18
Total cost
$8.98
Avg input
4,559
Avg output
5,228
Rate in / out
$5.00 / $30.00
gpt-5.5 high
GPT-5.5 · high
Score
88.6
Cost / call
$0.18
Total cost
$9.02
Avg input
4,559
Avg output
5,253
Rate in / out
$5.00 / $30.00
fable 5 high
Claude Fable 5 · high
Score
87.5
Cost / call
$0.32
Total cost
$16.12
Avg input
4,559
Avg output
5,535
Rate in / out
$10.00 / $50.00
Token counts use the saved coach prompt and saved structured output with a character-count estimate. Exact Gateway usage, cache reads, reasoning tokens, and any repair retries were not persisted, so this should be read as directional cost per coaching result.