Which models know sales?
26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 50
- Models
- 26
- Evaluations
- 1300
- Benchmark
- 86.2
Benchmark methodology
How the calls were generated, coached, and judged.
This benchmark tests sales coaching instincts, not transcript summarization. Each case is a synthetic B2B sales conversation generated from company research, persona design, and a hidden coaching answer key.
Coach models receive the setup, research, participants, and speaker-labeled transcript. They do not receive the ground-truth labels, hidden needles, evaluator notes, or transcript-generator label. The judge receives both the coach output and the hidden ground truth, then scores semantically.
The current static dataset includes 25 calls from the original GPT-based generation suite and 25 matching calls generated with Claude Sonnet 4.6. Rankings default to all available runs, so rows with broader coverage carry a larger n.
- Generated calls
- 50
- Judged runs
- 1300
- Models
- 26
- Needles
- 266
- Duration range
- 18-74m
- Avg duration
- 43m
Generation pipeline
The mocked-zoom app uses Workflow DevKit steps and Vercel AI Gateway for structured generation.
Scenario input
A suite scenario defines seller company, buyer company, call type, duration, turn count, and the intended quality profile.
Research brief
The generator runs web research for both companies and asks an LLM to produce a concise, source-grounded sales-call brief.
Hidden eval design
Before any transcript is written, an LLM designs 2 to 6 coaching needles: strengths, flaws, expected evidence, anti-evidence, and coaching implications.
Personas
Seller and buyer personas inherit the hidden coaching signals so their behavior can naturally create or pressure-test those needles.
Transcript turns
The conversation is generated one turn at a time. Each turn chooses the next speaker and writes only that speaker's next spoken contribution.
Artifacts
The completed call is rendered into VTT transcript, replay HTML, audio when available, video placeholder, Zoom-like recording files, and manifest JSON.
Coach run
Each coach model gets the visible setup, research, participants, and transcript with speaker labels. Hidden ground truth is excluded.
Judge run
The judge compares the coach output to hidden ground truth, credits semantic matches, penalizes unsupported claims, and returns an eight-axis scorecard.
Benchmark coverage
Calls are bucketed by the input call type and quality profile rather than by inferred labels.
Transcript generator
Call type
Quality profile
Ground truth
The hidden answer key is intended to create sales-coaching needle-in-the-haystack problems.
- Total needles
- 266
- Flaws
- 129
- Strengths
- 137
Models and scoring
Every visible coach output is judged against the same hidden case material.
- Claude Fable 5
- 1
- high
- Claude Opus 4.7
- 5
- low, medium, high, xhigh, max
- Claude Opus 4.8
- 5
- low, medium, high, xhigh, max
- Claude Sonnet 4.6
- 1
- default
- Claude Sonnet 5
- 1
- default
- DeepSeek V4 Pro
- 1
- default
- Gemini 3.1 Pro Preview
- 1
- default
- GLM 5.2
- 1
- default
- GPT-5.4
- 5
- none, low, medium, high, xhigh
- GPT-5.5
- 5
- none, low, medium, high, xhigh
The exported run set includes GPT-5.4, GPT-5.5, Claude Opus 4.8, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview configurations. The judge configuration is GPT-5.5 with high reasoning.
The main leaderboard is ranked by a sales coaching benchmark score: 25% overall score, 20% needle recall, 20% sales instinct, 15% prioritization, 10% technical accuracy, and 10% false-positive control. Raw average score and downside floor stay visible as supporting evidence.
Guardrails
The results site is intentionally a static, shareable view over the benchmark output.
- Call pages include the speaker-labeled transcript shown to the coach models, plus setup, hidden needles, and model scores.
- No real customer calls are included. The cases are synthetic, but they are generated from research and realistic persona dynamics. The export includes both GPT-generated and Sonnet-generated transcript suites.
- The judge scores semantically and awards partial credit. It does not use string matching.
- Audio and video artifacts exist in the generation app, but the current score tables evaluate transcript-grounded coaching output.