Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
All eight scoring axes

How each model scores on every dimension

Cells shaded by score — greener is better. Models sorted by overall average.

Average score by model and scorecard dimension.
ModelOverallNeedle recallEvidence groundingFalse-positive controlPrioritizationActionabilitySales instinctTechnical accuracy
gpt-5.4 high
GPT-5.4 · high
92.0
93.4
94.6
91.6
90.9
94.2
93.0
93.7
gpt-5.5 none
GPT-5.5 · none
92.0
93.8
94.9
90.2
90.1
94.1
92.3
93.4
gpt-5.5 xhigh
GPT-5.5 · xhigh
92.0
94.1
95.2
90.6
90.5
94.4
92.0
93.8
gpt-5.4 xhigh
GPT-5.4 · xhigh
92.0
93.2
95.0
91.5
90.8
94.2
92.9
93.3
gpt-5.5 high
GPT-5.5 · high
91.7
93.8
94.5
90.9
89.2
94.5
92.1
93.4
gpt-5.5 medium
GPT-5.5 · medium
91.7
93.4
94.4
88.9
90.6
94.6
92.5
93.2
gpt-5.4 medium
GPT-5.4 · medium
90.9
91.7
94.3
89.8
88.9
93.5
91.7
92.9
gpt-5.4 none
GPT-5.4 · none
90.8
91.9
93.9
88.8
89.4
92.9
91.1
93.1
gpt-5.5 low
GPT-5.5 · low
90.8
92.6
94.0
88.9
89.6
93.8
91.6
92.3
gpt-5.4 low
GPT-5.4 · low
90.3
91.0
93.2
89.1
88.8
93.4
90.9
92.4
opus 4.7 max
Claude Opus 4.7 · max
90.2
92.1
92.6
86.0
88.5
93.6
91.1
91.8
opus 4.7 high
Claude Opus 4.7 · high
89.6
90.7
91.9
85.8
87.6
92.9
90.3
91.5
opus 4.7 xhigh
Claude Opus 4.7 · xhigh
89.4
91.2
91.8
86.2
87.3
92.8
90.2
91.6
opus 4.7 medium
Claude Opus 4.7 · medium
89.0
90.1
91.7
86.2
85.8
92.3
89.7
90.9
sonnet 4.6
Claude Sonnet 4.6 · default
88.8
90.8
89.8
83.1
87.1
92.5
90.4
89.7
opus 4.7 low
Claude Opus 4.7 · low
87.6
88.0
90.8
84.0
84.8
90.8
87.8
90.6
deepseek v4 pro
DeepSeek V4 Pro · default
85.8
86.1
88.9
81.2
83.3
88.9
86.3
88.0
gemini 3.1 pro preview
Gemini 3.1 Pro Preview · default
82.4
79.6
88.1
79.8
80.2
85.6
84.0
86.2
Mean89.891.092.887.488.092.790.591.8
Coach model only

Estimated coach-run cost

Estimated from the saved visible prompt, saved coach output, and AI Gateway listed input/output rates. Generation and judging are excluded.

Est. coach spend
$49.30
Median / call
$0.10
Rows
18
deepseek v4 pro
DeepSeek V4 Pro · default
Score
85.8
Cost / call
$0.0047
25-call cost
$0.12
Avg input
4,521
Avg output
3,187
Rate in / out
$0.43 / $0.87
gemini 3.1 pro preview
Gemini 3.1 Pro Preview · default
Score
82.4
Cost / call
$0.03
25-call cost
$0.80
Avg input
4,521
Avg output
1,918
Rate in / out
$2.00 / $12.00
gpt-5.4 none
GPT-5.4 · none
Score
90.8
Cost / call
$0.07
25-call cost
$1.73
Avg input
4,521
Avg output
3,865
Rate in / out
$2.50 / $15.00
gpt-5.4 xhigh
GPT-5.4 · xhigh
Score
92.0
Cost / call
$0.07
25-call cost
$1.81
Avg input
4,521
Avg output
4,073
Rate in / out
$2.50 / $15.00
gpt-5.4 medium
GPT-5.4 · medium
Score
90.9
Cost / call
$0.07
25-call cost
$1.82
Avg input
4,521
Avg output
4,106
Rate in / out
$2.50 / $15.00
gpt-5.4 high
GPT-5.4 · high
Score
92.0
Cost / call
$0.07
25-call cost
$1.85
Avg input
4,521
Avg output
4,177
Rate in / out
$2.50 / $15.00
gpt-5.4 low
GPT-5.4 · low
Score
90.3
Cost / call
$0.07
25-call cost
$1.85
Avg input
4,521
Avg output
4,180
Rate in / out
$2.50 / $15.00
opus 4.7 low
Claude Opus 4.7 · low
Score
87.6
Cost / call
$0.09
25-call cost
$2.33
Avg input
4,521
Avg output
2,820
Rate in / out
$5.00 / $25.00
sonnet 4.6
Claude Sonnet 4.6 · default
Score
88.8
Cost / call
$0.10
25-call cost
$2.59
Avg input
4,521
Avg output
5,993
Rate in / out
$3.00 / $15.00
opus 4.7 medium
Claude Opus 4.7 · medium
Score
89.0
Cost / call
$0.10
25-call cost
$2.59
Avg input
4,521
Avg output
3,244
Rate in / out
$5.00 / $25.00
opus 4.7 high
Claude Opus 4.7 · high
Score
89.6
Cost / call
$0.12
25-call cost
$3.03
Avg input
4,521
Avg output
3,946
Rate in / out
$5.00 / $25.00
opus 4.7 xhigh
Claude Opus 4.7 · xhigh
Score
89.4
Cost / call
$0.13
25-call cost
$3.25
Avg input
4,521
Avg output
4,293
Rate in / out
$5.00 / $25.00
opus 4.7 max
Claude Opus 4.7 · max
Score
90.2
Cost / call
$0.15
25-call cost
$3.82
Avg input
4,521
Avg output
5,210
Rate in / out
$5.00 / $25.00
gpt-5.5 none
GPT-5.5 · none
Score
92.0
Cost / call
$0.17
25-call cost
$4.16
Avg input
4,521
Avg output
4,791
Rate in / out
$5.00 / $30.00
gpt-5.5 low
GPT-5.5 · low
Score
90.8
Cost / call
$0.17
25-call cost
$4.33
Avg input
4,521
Avg output
5,014
Rate in / out
$5.00 / $30.00
gpt-5.5 medium
GPT-5.5 · medium
Score
91.7
Cost / call
$0.17
25-call cost
$4.34
Avg input
4,521
Avg output
5,027
Rate in / out
$5.00 / $30.00
gpt-5.5 high
GPT-5.5 · high
Score
91.7
Cost / call
$0.18
25-call cost
$4.42
Avg input
4,521
Avg output
5,138
Rate in / out
$5.00 / $30.00
gpt-5.5 xhigh
GPT-5.5 · xhigh
Score
92.0
Cost / call
$0.18
25-call cost
$4.47
Avg input
4,521
Avg output
5,206
Rate in / out
$5.00 / $30.00

Token counts use the saved coach prompt and saved structured output with a character-count estimate. Exact Gateway usage, cache reads, reasoning tokens, and any repair retries were not persisted, so this should be read as directional cost per coaching result.