salesevals.com/Evaluated Jul 1, 2026
Which models know sales?
26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 50
- Models
- 26
- Evaluations
- 1300
- Benchmark
- 86.2
50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026
Change vs that model's overall average
Where models do better or worse: by call quality
Average overall per call quality. Green means the model does better than its own baseline on that slice; red means worse.
| Model | Overall | Excellent n=18 | Mixed n=14 | Flawed n=18 |
|---|---|---|---|---|
gpt-5.4 xhigh GPT-5.4 | 89.0 | 88.9-0.1 | 87.6-1.3 | 90.1+1.1 |
gpt-5.4 high GPT-5.4 | 89.0 | 88.2-0.7 | 87.7-1.2 | 90.7+1.7 |
gpt-5.5 medium GPT-5.5 | 88.8 | 90.1+1.2 | 86.7-2.1 | 89.3+0.4 |
gpt-5.5 xhigh GPT-5.5 | 89.0 | 89.8+0.8 | 87.7-1.3 | 89.2+0.2 |
gpt-5.5 high GPT-5.5 | 88.6 | 90.1+1.6 | 84.5-4.1 | 90.2+1.6 |
gpt-5.4 medium GPT-5.4 | 88.3 | 88.1-0.2 | 85.6-2.7 | 90.6+2.3 |
gpt-5.5 none GPT-5.5 | 88.1 | 89.6+1.5 | 85.0-3.1 | 89.1+1.0 |
gpt-5.5 low GPT-5.5 | 87.7 | 89.4+1.7 | 85.6-2.1 | 87.70.0 |
fable 5 high Claude Fable 5 | 87.5 | 86.9-0.6 | 85.9-1.6 | 89.3+1.8 |
gpt-5.4 low GPT-5.4 | 87.4 | 87.9+0.6 | 84.8-2.6 | 88.8+1.4 |
gpt-5.4 none GPT-5.4 | 87.4 | 87.9+0.5 | 84.5-2.9 | 89.1+1.7 |
opus 4.7 max Claude Opus 4.7 | 87.3 | 88.8+1.6 | 83.6-3.6 | 88.5+1.2 |
opus 4.7 high Claude Opus 4.7 | 86.8 | 87.1+0.3 | 83.4-3.5 | 89.2+2.4 |
opus 4.8 medium Claude Opus 4.8 | 85.8 | 86.6+0.8 | 81.1-4.7 | 88.6+2.9 |
opus 4.7 medium Claude Opus 4.7 | 85.6 | 87.2+1.6 | 80.0-5.6 | 88.4+2.8 |
opus 4.7 xhigh Claude Opus 4.7 | 85.6 | 86.8+1.2 | 81.1-4.5 | 87.9+2.3 |
opus 4.7 low Claude Opus 4.7 | 85.6 | 86.6+1.0 | 80.8-4.8 | 88.4+2.8 |
opus 4.8 max Claude Opus 4.8 | 85.4 | 85.9+0.5 | 80.4-5.0 | 88.8+3.4 |
opus 4.8 xhigh Claude Opus 4.8 | 85.2 | 88.1+2.8 | 79.4-5.9 | 87.0+1.8 |
opus 4.8 high Claude Opus 4.8 | 84.9 | 87.3+2.4 | 77.4-7.5 | 88.4+3.5 |
sonnet 4.6 Claude Sonnet 4.6 | 84.6 | 84.7+0.1 | 80.1-4.4 | 87.9+3.3 |
sonnet 5 Claude Sonnet 5 | 84.6 | 84.0-0.6 | 83.4-1.2 | 86.2+1.6 |
opus 4.8 low Claude Opus 4.8 | 84.0 | 86.7+2.7 | 77.0-7.0 | 86.7+2.7 |
glm 5.2 GLM 5.2 | 84.0 | 85.8+1.8 | 79.2-4.8 | 85.9+1.9 |
deepseek v4 pro DeepSeek V4 Pro | 83.5 | 86.2+2.7 | 76.6-6.9 | 86.2+2.7 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview | 78.9 | 81.6+2.7 | 69.8-9.1 | 83.3+4.4 |
Change vs that model's overall average
Where models do better or worse: by call type
Same view, sliced by what kind of call it was.
| Model | Overall | Discovery n=16 | Product demo n=20 | Renewal save n=4 | QBR n=4 | Competitive displacement n=6 |
|---|---|---|---|---|---|---|
gpt-5.4 xhigh GPT-5.4 | 89.0 | 89.7+0.7 | 88.5-0.5 | 86.1-2.9 | 88.0-1.0 | 91.5+2.5 |
gpt-5.4 high GPT-5.4 | 89.0 | 90.4+1.5 | 87.6-1.4 | 89.8+0.8 | 87.5-1.5 | 90.0+1.0 |
gpt-5.5 medium GPT-5.5 | 88.8 | 89.9+1.1 | 87.9-0.9 | 87.0-1.8 | 90.8+1.9 | 89.0+0.2 |
gpt-5.5 xhigh GPT-5.5 | 89.0 | 89.3+0.2 | 88.4-0.6 | 89.8+0.7 | 88.8-0.3 | 90.2+1.1 |
gpt-5.5 high GPT-5.5 | 88.6 | 90.4+1.8 | 87.8-0.8 | 80.8-7.8 | 87.8-0.8 | 92.0+3.4 |
gpt-5.4 medium GPT-5.4 | 88.3 | 90.1+1.8 | 87.3-0.9 | 84.3-4.0 | 86.8-1.5 | 90.2+1.9 |
gpt-5.5 none GPT-5.5 | 88.1 | 90.1+2.0 | 86.5-1.7 | 86.0-2.1 | 88.0-0.1 | 90.0+1.9 |
gpt-5.5 low GPT-5.5 | 87.7 | 89.0+1.3 | 86.5-1.2 | 87.0-0.7 | 86.5-1.2 | 89.3+1.6 |
fable 5 high Claude Fable 5 | 87.5 | 89.7+2.2 | 84.8-2.7 | 87.3-0.3 | 88.5+1.0 | 90.2+2.6 |
gpt-5.4 low GPT-5.4 | 87.4 | 89.3+1.9 | 86.5-0.9 | 86.3-1.1 | 87.0-0.4 | 86.2-1.2 |
gpt-5.4 none GPT-5.4 | 87.4 | 88.9+1.6 | 86.5-0.9 | 86.3-1.1 | 86.5-0.9 | 87.5+0.1 |
opus 4.7 max Claude Opus 4.7 | 87.3 | 91.3+4.1 | 84.5-2.8 | 89.3+2.0 | 85.8-1.5 | 85.5-1.8 |
opus 4.7 high Claude Opus 4.7 | 86.8 | 90.1+3.2 | 83.7-3.2 | 86.8-0.1 | 86.3-0.6 | 89.2+2.3 |
opus 4.8 medium Claude Opus 4.8 | 85.8 | 89.4+3.6 | 81.5-4.2 | 87.3+1.5 | 86.5+0.7 | 88.7+2.9 |
opus 4.7 medium Claude Opus 4.7 | 85.6 | 89.8+4.2 | 81.6-4.0 | 86.3+0.6 | 84.8-0.9 | 88.0+2.4 |
opus 4.7 xhigh Claude Opus 4.7 | 85.6 | 87.9+2.3 | 83.0-2.5 | 86.3+0.7 | 84.0-1.6 | 88.5+2.9 |
opus 4.7 low Claude Opus 4.7 | 85.6 | 90.2+4.6 | 82.0-3.6 | 86.8+1.1 | 84.5-1.1 | 85.5-0.1 |
opus 4.8 max Claude Opus 4.8 | 85.4 | 88.8+3.4 | 81.3-4.1 | 84.8-0.7 | 87.8+2.3 | 88.8+3.4 |
opus 4.8 xhigh Claude Opus 4.8 | 85.2 | 88.5+3.3 | 81.5-3.8 | 82.8-2.5 | 86.8+1.5 | 89.8+4.6 |
opus 4.8 high Claude Opus 4.8 | 84.9 | 89.9+5.0 | 81.4-3.5 | 78.5-6.4 | 86.0+1.1 | 86.8+1.9 |
sonnet 4.6 Claude Sonnet 4.6 | 84.6 | 86.9+2.3 | 83.3-1.2 | 82.0-2.6 | 82.0-2.6 | 85.8+1.3 |
sonnet 5 Claude Sonnet 5 | 84.6 | 86.6+2.0 | 82.0-2.6 | 88.0+3.4 | 82.8-1.8 | 87.2+2.6 |
opus 4.8 low Claude Opus 4.8 | 84.0 | 88.1+4.1 | 80.5-3.5 | 85.8+1.8 | 81.8-2.2 | 85.0+1.0 |
glm 5.2 GLM 5.2 | 84.0 | 86.3+2.3 | 83.0-1.0 | 85.5+1.5 | 79.5-4.5 | 83.2-0.8 |
deepseek v4 pro DeepSeek V4 Pro | 83.5 | 87.4+3.9 | 81.5-2.0 | 82.0-1.5 | 82.3-1.3 | 81.8-1.7 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview | 78.9 | 84.4+5.5 | 74.7-4.2 | 74.3-4.7 | 77.5-1.4 | 82.3+3.4 |
GPT-generated vs Sonnet-generated calls
Where models do better or worse: by transcript generator
This slice keeps the default leaderboard behavior: all available runs are included, and each cell shows the available score for that transcript-origin bucket.
| Model | Overall | GPT-generated n=25 | Sonnet-generated n=25 |
|---|---|---|---|
gpt-5.4 xhigh GPT-5.4 | 89.0 | 92.0+3.0 | 86.0-3.0 |
gpt-5.4 high GPT-5.4 | 89.0 | 92.0+3.1 | 85.9-3.1 |
gpt-5.5 medium GPT-5.5 | 88.8 | 91.7+2.8 | 86.0-2.8 |
gpt-5.5 xhigh GPT-5.5 | 89.0 | 92.0+3.0 | 86.0-3.0 |
gpt-5.5 high GPT-5.5 | 88.6 | 91.7+3.2 | 85.4-3.2 |
gpt-5.4 medium GPT-5.4 | 88.3 | 90.9+2.6 | 85.6-2.6 |
gpt-5.5 none GPT-5.5 | 88.1 | 92.0+3.9 | 84.3-3.9 |
gpt-5.5 low GPT-5.5 | 87.7 | 90.8+3.1 | 84.6-3.1 |
fable 5 high Claude Fable 5 | 87.5 | 91.0+3.4 | 84.1-3.4 |
gpt-5.4 low GPT-5.4 | 87.4 | 90.3+3.0 | 84.4-3.0 |
gpt-5.4 none GPT-5.4 | 87.4 | 90.8+3.5 | 83.9-3.5 |
opus 4.7 max Claude Opus 4.7 | 87.3 | 90.2+3.0 | 84.3-3.0 |
opus 4.7 high Claude Opus 4.7 | 86.8 | 89.6+2.7 | 84.1-2.7 |
opus 4.8 medium Claude Opus 4.8 | 85.8 | 89.2+3.4 | 82.4-3.4 |
opus 4.7 medium Claude Opus 4.7 | 85.6 | 89.0+3.3 | 82.3-3.3 |
opus 4.7 xhigh Claude Opus 4.7 | 85.6 | 89.4+3.8 | 81.8-3.8 |
opus 4.7 low Claude Opus 4.7 | 85.6 | 87.6+2.0 | 83.6-2.0 |
opus 4.8 max Claude Opus 4.8 | 85.4 | 88.6+3.1 | 82.3-3.1 |
opus 4.8 xhigh Claude Opus 4.8 | 85.2 | 88.1+2.9 | 82.4-2.9 |
opus 4.8 high Claude Opus 4.8 | 84.9 | 89.4+4.5 | 80.4-4.5 |
sonnet 4.6 Claude Sonnet 4.6 | 84.6 | 88.8+4.3 | 80.3-4.3 |
sonnet 5 Claude Sonnet 5 | 84.6 | 87.4+2.8 | 81.8-2.8 |
opus 4.8 low Claude Opus 4.8 | 84.0 | 88.0+4.0 | 80.0-4.0 |
glm 5.2 GLM 5.2 | 84.0 | 86.1+2.1 | 81.8-2.1 |
deepseek v4 pro DeepSeek V4 Pro | 83.5 | 85.8+2.3 | 81.2-2.3 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview | 78.9 | 82.4+3.5 | 75.4-3.5 |