Which models know sales?
Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 25
- Models
- 18
- Evaluations
- 450
- Mean
- 89.8
Leaderboard
Average 89.8 across 450 evaluations.
| # | Model | Avg | Cost / call | Distribution | Min | Max | n |
|---|---|---|---|---|---|---|---|
| 1 | gpt-5.4 highLeader GPT-5.4 · high | 92.0 | $0.07 | 87.0 | 96.0 | 25 | |
| 2 | gpt-5.5 none GPT-5.5 · none | 92.0 | $0.17 | 76.0 | 96.0 | 25 | |
| 3 | gpt-5.5 xhigh GPT-5.5 · xhigh | 92.0 | $0.18 | 76.0 | 97.0 | 25 | |
| 4 | gpt-5.4 xhigh GPT-5.4 · xhigh | 92.0 | $0.07 | 82.0 | 96.0 | 25 | |
| 5 | gpt-5.5 high GPT-5.5 · high | 91.7 | $0.18 | 72.0 | 97.0 | 25 | |
| 6 | gpt-5.5 medium GPT-5.5 · medium | 91.7 | $0.17 | 82.0 | 97.0 | 25 | |
| 7 | gpt-5.4 medium GPT-5.4 · medium | 90.9 | $0.07 | 81.0 | 96.0 | 25 | |
| 8 | gpt-5.4 none GPT-5.4 · none | 90.8 | $0.07 | 82.0 | 96.0 | 25 | |
| 9 | gpt-5.5 low GPT-5.5 · low | 90.8 | $0.17 | 84.0 | 96.0 | 25 | |
| 10 | gpt-5.4 low GPT-5.4 · low | 90.3 | $0.07 | 78.0 | 96.0 | 25 | |
| 11 | opus 4.7 max Claude Opus 4.7 · max | 90.2 | $0.15 | 72.0 | 96.0 | 25 | |
| 12 | opus 4.7 high Claude Opus 4.7 · high | 89.6 | $0.12 | 71.0 | 95.0 | 25 | |
| 13 | opus 4.7 xhigh Claude Opus 4.7 · xhigh | 89.4 | $0.13 | 75.0 | 95.0 | 25 | |
| 14 | opus 4.7 medium Claude Opus 4.7 · medium | 89.0 | $0.10 | 72.0 | 95.0 | 25 | |
| 15 | sonnet 4.6 Claude Sonnet 4.6 · default | 88.8 | $0.10 | 84.0 | 95.0 | 25 | |
| 16 | opus 4.7 low Claude Opus 4.7 · low | 87.6 | $0.09 | 70.0 | 96.0 | 25 | |
| 17 | deepseek v4 pro DeepSeek V4 Pro · default | 85.8 | $0.0047 | 70.0 | 95.0 | 25 | |
| 18 | gemini 3.1 pro previewTrailing Gemini 3.1 Pro Preview · default | 82.4 | $0.03 | 67.0 | 93.0 | 25 | |
| Filtered mean | 89.8 | ||||||
gpt-5.4 high
87.0–96.0 across 25 calls
Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft
Claude Opus 4.7: low → max
87.6 → 90.2 as effort scales
Methodology
The calls are synthetic, immutable benchmark cases with hidden coaching ground truth.
Generate cases
Each case starts with two companies, a call type, duration, quality target, web research, and role notes.
Write turn by turn
The transcript is produced through many structured LLM calls, one speaker turn at a time, using persona behavior and hidden coaching signals.
Judge semantically
Coach models see only the visible case. A judge model sees hidden ground truth and scores grounded coaching quality.
- Calls
- 25
- Model configs
- 18
- Judged runs
- 450
- Score axes
- 8
Browse the calls
Pick a call to see the hidden answer key and how each model scored on it.
Berkshire Hathaway Data governance discovery across decentralized business units with Collibra
Pave Pricing and packaging objection call with Stripe
Mercury First discovery for frontend platform consolidation with Vercel
Delta Air Lines Enterprise discovery for service management modernization with Atlassian
Wayfair Integration deep dive for catalog modernization with MongoDB
The Home Depot Renewal save call after usage and support concerns with Twilio
Apple Technical security review for zero trust architecture with Palo Alto Networks
Duolingo Renewal QBR and expansion planning with Amplitude
CVS Health AI contact-center transformation discovery with OpenAI
Rippling Product-led expansion discovery for developer workflow with GitHub
McKesson HR transformation qualification and stakeholder mapping with Workday
ExxonMobil AI governance and safety review for energy operations with Anthropic
Target Security architecture review for endpoint consolidation with CrowdStrike
Linear Technical demo for observability and incident response with Datadog
JPMorgan Chase Technical workshop for search and observability consolidation with Elastic
Walmart Executive discovery for AI infrastructure and store operations with NVIDIA
Amazon Cloud operating model discussion for internal platform teams with HashiCorp
Ford Motor Company Procurement negotiation for workflow automation with ServiceNow
Toast Data platform proof-of-concept kickoff with Snowflake
Canva Competitive displacement discovery for edge security with Cloudflare
The Walt Disney Company Design collaboration demo with brand and asset workflow discussion with Figma
Sweetgreen Executive alignment for identity modernization with Okta
UnitedHealth Group Healthcare CRM expansion objection handling with Salesforce
Runway Security review before developer-tool rollout with Snyk
Costco Wholesale Proof-of-concept readout for analytics and productivity workflow with Microsoft
Does spending more on reasoning help?
For families with multiple reasoning settings, this shows the average score at each level.