salesevals.com/Evaluated Jul 1, 2026

Which models know sales?

26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls: 50
Models: 26
Evaluations: 1300
Benchmark: 86.2

50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026

Change vs that model's overall average

Where models do better or worse: by call quality

Average overall per call quality. Green means the model does better than its own baseline on that slice; red means worse.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
Model	Overall	Excellent n=18	Mixed n=14	Flawed n=18
gpt-5.4 xhigh GPT-5.4	89.0	88.9-0.1	87.6-1.3	90.1+1.1
gpt-5.4 high GPT-5.4	89.0	88.2-0.7	87.7-1.2	90.7+1.7
gpt-5.5 medium GPT-5.5	88.8	90.1+1.2	86.7-2.1	89.3+0.4
gpt-5.5 xhigh GPT-5.5	89.0	89.8+0.8	87.7-1.3	89.2+0.2
gpt-5.5 high GPT-5.5	88.6	90.1+1.6	84.5-4.1	90.2+1.6
gpt-5.4 medium GPT-5.4	88.3	88.1-0.2	85.6-2.7	90.6+2.3
gpt-5.5 none GPT-5.5	88.1	89.6+1.5	85.0-3.1	89.1+1.0
gpt-5.5 low GPT-5.5	87.7	89.4+1.7	85.6-2.1	87.70.0
fable 5 high Claude Fable 5	87.5	86.9-0.6	85.9-1.6	89.3+1.8
gpt-5.4 low GPT-5.4	87.4	87.9+0.6	84.8-2.6	88.8+1.4
gpt-5.4 none GPT-5.4	87.4	87.9+0.5	84.5-2.9	89.1+1.7
opus 4.7 max Claude Opus 4.7	87.3	88.8+1.6	83.6-3.6	88.5+1.2
opus 4.7 high Claude Opus 4.7	86.8	87.1+0.3	83.4-3.5	89.2+2.4
opus 4.8 medium Claude Opus 4.8	85.8	86.6+0.8	81.1-4.7	88.6+2.9
opus 4.7 medium Claude Opus 4.7	85.6	87.2+1.6	80.0-5.6	88.4+2.8
opus 4.7 xhigh Claude Opus 4.7	85.6	86.8+1.2	81.1-4.5	87.9+2.3
opus 4.7 low Claude Opus 4.7	85.6	86.6+1.0	80.8-4.8	88.4+2.8
opus 4.8 max Claude Opus 4.8	85.4	85.9+0.5	80.4-5.0	88.8+3.4
opus 4.8 xhigh Claude Opus 4.8	85.2	88.1+2.8	79.4-5.9	87.0+1.8
opus 4.8 high Claude Opus 4.8	84.9	87.3+2.4	77.4-7.5	88.4+3.5
sonnet 4.6 Claude Sonnet 4.6	84.6	84.7+0.1	80.1-4.4	87.9+3.3
sonnet 5 Claude Sonnet 5	84.6	84.0-0.6	83.4-1.2	86.2+1.6
opus 4.8 low Claude Opus 4.8	84.0	86.7+2.7	77.0-7.0	86.7+2.7
glm 5.2 GLM 5.2	84.0	85.8+1.8	79.2-4.8	85.9+1.9
deepseek v4 pro DeepSeek V4 Pro	83.5	86.2+2.7	76.6-6.9	86.2+2.7
gemini 3.1 pro preview Gemini 3.1 Pro Preview	78.9	81.6+2.7	69.8-9.1	83.3+4.4

Change vs that model's overall average

Where models do better or worse: by call type

Same view, sliced by what kind of call it was.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
Model	Overall	Discovery n=16	Product demo n=20	Renewal save n=4	QBR n=4	Competitive displacement n=6
gpt-5.4 xhigh GPT-5.4	89.0	89.7+0.7	88.5-0.5	86.1-2.9	88.0-1.0	91.5+2.5
gpt-5.4 high GPT-5.4	89.0	90.4+1.5	87.6-1.4	89.8+0.8	87.5-1.5	90.0+1.0
gpt-5.5 medium GPT-5.5	88.8	89.9+1.1	87.9-0.9	87.0-1.8	90.8+1.9	89.0+0.2
gpt-5.5 xhigh GPT-5.5	89.0	89.3+0.2	88.4-0.6	89.8+0.7	88.8-0.3	90.2+1.1
gpt-5.5 high GPT-5.5	88.6	90.4+1.8	87.8-0.8	80.8-7.8	87.8-0.8	92.0+3.4
gpt-5.4 medium GPT-5.4	88.3	90.1+1.8	87.3-0.9	84.3-4.0	86.8-1.5	90.2+1.9
gpt-5.5 none GPT-5.5	88.1	90.1+2.0	86.5-1.7	86.0-2.1	88.0-0.1	90.0+1.9
gpt-5.5 low GPT-5.5	87.7	89.0+1.3	86.5-1.2	87.0-0.7	86.5-1.2	89.3+1.6
fable 5 high Claude Fable 5	87.5	89.7+2.2	84.8-2.7	87.3-0.3	88.5+1.0	90.2+2.6
gpt-5.4 low GPT-5.4	87.4	89.3+1.9	86.5-0.9	86.3-1.1	87.0-0.4	86.2-1.2
gpt-5.4 none GPT-5.4	87.4	88.9+1.6	86.5-0.9	86.3-1.1	86.5-0.9	87.5+0.1
opus 4.7 max Claude Opus 4.7	87.3	91.3+4.1	84.5-2.8	89.3+2.0	85.8-1.5	85.5-1.8
opus 4.7 high Claude Opus 4.7	86.8	90.1+3.2	83.7-3.2	86.8-0.1	86.3-0.6	89.2+2.3
opus 4.8 medium Claude Opus 4.8	85.8	89.4+3.6	81.5-4.2	87.3+1.5	86.5+0.7	88.7+2.9
opus 4.7 medium Claude Opus 4.7	85.6	89.8+4.2	81.6-4.0	86.3+0.6	84.8-0.9	88.0+2.4
opus 4.7 xhigh Claude Opus 4.7	85.6	87.9+2.3	83.0-2.5	86.3+0.7	84.0-1.6	88.5+2.9
opus 4.7 low Claude Opus 4.7	85.6	90.2+4.6	82.0-3.6	86.8+1.1	84.5-1.1	85.5-0.1
opus 4.8 max Claude Opus 4.8	85.4	88.8+3.4	81.3-4.1	84.8-0.7	87.8+2.3	88.8+3.4
opus 4.8 xhigh Claude Opus 4.8	85.2	88.5+3.3	81.5-3.8	82.8-2.5	86.8+1.5	89.8+4.6
opus 4.8 high Claude Opus 4.8	84.9	89.9+5.0	81.4-3.5	78.5-6.4	86.0+1.1	86.8+1.9
sonnet 4.6 Claude Sonnet 4.6	84.6	86.9+2.3	83.3-1.2	82.0-2.6	82.0-2.6	85.8+1.3
sonnet 5 Claude Sonnet 5	84.6	86.6+2.0	82.0-2.6	88.0+3.4	82.8-1.8	87.2+2.6
opus 4.8 low Claude Opus 4.8	84.0	88.1+4.1	80.5-3.5	85.8+1.8	81.8-2.2	85.0+1.0
glm 5.2 GLM 5.2	84.0	86.3+2.3	83.0-1.0	85.5+1.5	79.5-4.5	83.2-0.8
deepseek v4 pro DeepSeek V4 Pro	83.5	87.4+3.9	81.5-2.0	82.0-1.5	82.3-1.3	81.8-1.7
gemini 3.1 pro preview Gemini 3.1 Pro Preview	78.9	84.4+5.5	74.7-4.2	74.3-4.7	77.5-1.4	82.3+3.4

GPT-generated vs Sonnet-generated calls

Where models do better or worse: by transcript generator

This slice keeps the default leaderboard behavior: all available runs are included, and each cell shows the available score for that transcript-origin bucket.

Model performance by benchmark segment, shown as average score and delta from each model baseline.
Model	Overall	GPT-generated n=25	Sonnet-generated n=25
gpt-5.4 xhigh GPT-5.4	89.0	92.0+3.0	86.0-3.0
gpt-5.4 high GPT-5.4	89.0	92.0+3.1	85.9-3.1
gpt-5.5 medium GPT-5.5	88.8	91.7+2.8	86.0-2.8
gpt-5.5 xhigh GPT-5.5	89.0	92.0+3.0	86.0-3.0
gpt-5.5 high GPT-5.5	88.6	91.7+3.2	85.4-3.2
gpt-5.4 medium GPT-5.4	88.3	90.9+2.6	85.6-2.6
gpt-5.5 none GPT-5.5	88.1	92.0+3.9	84.3-3.9
gpt-5.5 low GPT-5.5	87.7	90.8+3.1	84.6-3.1
fable 5 high Claude Fable 5	87.5	91.0+3.4	84.1-3.4
gpt-5.4 low GPT-5.4	87.4	90.3+3.0	84.4-3.0
gpt-5.4 none GPT-5.4	87.4	90.8+3.5	83.9-3.5
opus 4.7 max Claude Opus 4.7	87.3	90.2+3.0	84.3-3.0
opus 4.7 high Claude Opus 4.7	86.8	89.6+2.7	84.1-2.7
opus 4.8 medium Claude Opus 4.8	85.8	89.2+3.4	82.4-3.4
opus 4.7 medium Claude Opus 4.7	85.6	89.0+3.3	82.3-3.3
opus 4.7 xhigh Claude Opus 4.7	85.6	89.4+3.8	81.8-3.8
opus 4.7 low Claude Opus 4.7	85.6	87.6+2.0	83.6-2.0
opus 4.8 max Claude Opus 4.8	85.4	88.6+3.1	82.3-3.1
opus 4.8 xhigh Claude Opus 4.8	85.2	88.1+2.9	82.4-2.9
opus 4.8 high Claude Opus 4.8	84.9	89.4+4.5	80.4-4.5
sonnet 4.6 Claude Sonnet 4.6	84.6	88.8+4.3	80.3-4.3
sonnet 5 Claude Sonnet 5	84.6	87.4+2.8	81.8-2.8
opus 4.8 low Claude Opus 4.8	84.0	88.0+4.0	80.0-4.0
glm 5.2 GLM 5.2	84.0	86.1+2.1	81.8-2.1
deepseek v4 pro DeepSeek V4 Pro	83.5	85.8+2.3	81.2-2.3
gemini 3.1 pro preview Gemini 3.1 Pro Preview	78.9	82.4+3.5	75.4-3.5