salesevals.com/Evaluated Apr 30, 2026
Which models know sales?
Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 25
- Models
- 18
- Evaluations
- 450
- Mean
- 89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
Change vs that model's overall average
Where models do better or worse: by call quality
Average overall per call quality. Green means the model does better than its own baseline on that slice; red means worse.
| Model | Overall | Excellent n=9 | Mixed n=7 | Flawed n=9 |
|---|---|---|---|---|
gpt-5.4 high GPT-5.4 | 92.0 | 92.7+0.6 | 90.1-1.9 | 92.9+0.8 |
gpt-5.5 none GPT-5.5 | 92.0 | 94.0+2.0 | 87.3-4.7 | 93.7+1.7 |
gpt-5.5 xhigh GPT-5.5 | 92.0 | 94.2+2.2 | 88.9-3.1 | 92.2+0.2 |
gpt-5.4 xhigh GPT-5.4 | 92.0 | 92.3+0.4 | 91.1-0.8 | 92.2+0.3 |
gpt-5.5 high GPT-5.5 | 91.7 | 93.9+2.2 | 86.9-4.9 | 93.3+1.6 |
gpt-5.5 medium GPT-5.5 | 91.7 | 93.8+2.1 | 88.0-3.7 | 92.4+0.8 |
gpt-5.4 medium GPT-5.4 | 90.9 | 90.1-0.8 | 88.1-2.8 | 93.9+3.0 |
gpt-5.4 none GPT-5.4 | 90.8 | 91.3+0.5 | 89.0-1.8 | 91.8+0.9 |
gpt-5.5 low GPT-5.5 | 90.8 | 93.8+3.0 | 86.9-3.9 | 90.9+0.1 |
gpt-5.4 low GPT-5.4 | 90.3 | 91.8+1.5 | 88.3-2.0 | 90.4+0.1 |
opus 4.7 max Claude Opus 4.7 | 90.2 | 91.2+1.0 | 85.4-4.8 | 93.0+2.8 |
opus 4.7 high Claude Opus 4.7 | 89.6 | 91.3+1.8 | 83.9-5.7 | 92.2+2.7 |
opus 4.7 xhigh Claude Opus 4.7 | 89.4 | 91.4+2.0 | 83.3-6.2 | 92.2+2.8 |
opus 4.7 medium Claude Opus 4.7 | 89.0 | 90.9+1.9 | 82.6-6.4 | 92.0+3.0 |
sonnet 4.6 Claude Sonnet 4.6 | 88.8 | 88.90.0 | 86.1-2.7 | 90.9+2.0 |
opus 4.7 low Claude Opus 4.7 | 87.6 | 88.9+1.2 | 82.0-5.6 | 90.8+3.1 |
deepseek v4 pro DeepSeek V4 Pro | 85.8 | 90.2+4.5 | 76.4-9.3 | 88.6+2.8 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview | 82.4 | 87.2+4.8 | 72.4-10.0 | 85.4+3.0 |
Change vs that model's overall average
Where models do better or worse: by call type
Same view, sliced by what kind of call it was.
| Model | Overall | Discovery n=8 | Product demo n=10 | Renewal save n=2 | QBR n=2 | Competitive displacement n=3 |
|---|---|---|---|---|---|---|
gpt-5.4 high GPT-5.4 | 92.0 | 93.5+1.5 | 91.2-0.8 | 93.0+1.0 | 90.5-1.5 | 91.3-0.7 |
gpt-5.5 none GPT-5.5 | 92.0 | 93.9+1.9 | 90.8-1.2 | 91.0-1.0 | 89.5-2.5 | 93.3+1.3 |
gpt-5.5 xhigh GPT-5.5 | 92.0 | 93.1+1.1 | 91.5-0.5 | 91.5-0.5 | 92.00.0 | 91.0-1.0 |
gpt-5.4 xhigh GPT-5.4 | 92.0 | 91.8-0.2 | 91.8-0.2 | 92.00.0 | 92.00.0 | 93.0+1.0 |
gpt-5.5 high GPT-5.5 | 91.7 | 93.5+1.8 | 90.6-1.1 | 88.5-3.2 | 90.5-1.2 | 93.7+1.9 |
gpt-5.5 medium GPT-5.5 | 91.7 | 93.4+1.7 | 91.0-0.7 | 89.0-2.7 | 93.0+1.3 | 90.3-1.3 |
gpt-5.4 medium GPT-5.4 | 90.9 | 92.9+2.0 | 89.3-1.6 | 90.5-0.4 | 91.0+0.1 | 91.3+0.4 |
gpt-5.4 none GPT-5.4 | 90.8 | 92.5+1.7 | 89.5-1.3 | 93.5+2.7 | 91.0+0.2 | 89.0-1.8 |
gpt-5.5 low GPT-5.5 | 90.8 | 92.0+1.2 | 90.5-0.3 | 89.0-1.8 | 89.0-1.8 | 91.0+0.2 |
gpt-5.4 low GPT-5.4 | 90.3 | 91.3+0.9 | 90.4+0.1 | 86.5-3.8 | 91.5+1.2 | 89.3-1.0 |
opus 4.7 max Claude Opus 4.7 | 90.2 | 93.6+3.4 | 88.0-2.2 | 92.0+1.8 | 88.5-1.7 | 88.7-1.6 |
opus 4.7 high Claude Opus 4.7 | 89.6 | 93.1+3.6 | 86.9-2.7 | 91.5+1.9 | 86.5-3.1 | 89.7+0.1 |
opus 4.7 xhigh Claude Opus 4.7 | 89.4 | 92.4+2.9 | 87.5-1.9 | 88.5-0.9 | 86.0-3.4 | 91.0+1.6 |
opus 4.7 medium Claude Opus 4.7 | 89.0 | 92.5+3.5 | 85.4-3.6 | 91.5+2.5 | 90.0+1.0 | 89.00.0 |
sonnet 4.6 Claude Sonnet 4.6 | 88.8 | 89.8+0.9 | 88.2-0.6 | 90.0+1.2 | 86.0-2.8 | 89.7+0.8 |
opus 4.7 low Claude Opus 4.7 | 87.6 | 91.6+4.0 | 84.6-3.0 | 90.0+2.4 | 86.0-1.6 | 86.7-1.0 |
deepseek v4 pro DeepSeek V4 Pro | 85.8 | 91.1+5.4 | 84.6-1.2 | 79.0-6.8 | 83.5-2.3 | 81.3-4.4 |
gemini 3.1 pro preview Gemini 3.1 Pro Preview | 82.4 | 86.0+3.6 | 80.1-2.3 | 80.5-1.9 | 82.0-0.4 | 82.3-0.1 |