Which models know sales?
Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 25
- Models
- 18
- Evaluations
- 450
- Mean
- 89.8
Benchmark methodology
How the calls were generated, coached, and judged.
This benchmark tests sales coaching instincts, not transcript summarization. Each case is a synthetic B2B sales conversation generated from company research, persona design, and a hidden coaching answer key.
Coach models receive the setup, research, participants, and speaker-labeled transcript. They do not receive the ground-truth labels, hidden needles, or evaluator notes. The judge receives both the coach output and the hidden ground truth, then scores semantically.
- Generated calls
- 25
- Judged runs
- 450
- Models
- 18
- Needles
- 135
- Duration range
- 18-74m
- Avg duration
- 43m
Generation pipeline
The mocked-zoom app uses Workflow DevKit steps and Vercel AI Gateway for structured generation.
Scenario input
A suite scenario defines seller company, buyer company, call type, duration, turn count, and the intended quality profile.
Research brief
The generator runs web research for both companies and asks an LLM to produce a concise, source-grounded sales-call brief.
Hidden eval design
Before any transcript is written, an LLM designs 2 to 6 coaching needles: strengths, flaws, expected evidence, anti-evidence, and coaching implications.
Personas
Seller and buyer personas inherit the hidden coaching signals so their behavior can naturally create or pressure-test those needles.
Transcript turns
The conversation is generated one turn at a time. Each turn chooses the next speaker and writes only that speaker's next spoken contribution.
Artifacts
The completed call is rendered into VTT transcript, replay HTML, audio when available, video placeholder, Zoom-like recording files, and manifest JSON.
Coach run
Each coach model gets the visible setup, research, participants, and transcript with speaker labels. Hidden ground truth is excluded.
Judge run
The judge compares the coach output to hidden ground truth, credits semantic matches, penalizes unsupported claims, and returns an eight-axis scorecard.
Benchmark coverage
Calls are bucketed by the input call type and quality profile rather than by inferred labels.
Call type
Quality profile
Ground truth
The hidden answer key is intended to create sales-coaching needle-in-the-haystack problems.
- Total needles
- 135
- Flaws
- 64
- Strengths
- 71
Models and scoring
Every visible coach output is judged against the same hidden case material.
- Claude Opus 4.7
- 5
- low, medium, high, xhigh, max
- Claude Sonnet 4.6
- 1
- default
- DeepSeek V4 Pro
- 1
- default
- Gemini 3.1 Pro Preview
- 1
- default
- GPT-5.4
- 5
- none, low, medium, high, xhigh
- GPT-5.5
- 5
- none, low, medium, high, xhigh
The exported run set includes GPT-5.4, GPT-5.5, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview configurations. The judge configuration is GPT-5.5 with high reasoning.
Guardrails
The results site is intentionally a static, shareable view over the benchmark output.
- Call pages include the speaker-labeled transcript shown to the coach models, plus setup, hidden needles, and model scores.
- No real customer calls are included. The cases are synthetic, but they are generated from research and realistic persona dynamics.
- The judge scores semantically and awards partial credit. It does not use string matching.
- Audio and video artifacts exist in the generation app, but the current score tables evaluate transcript-grounded coaching output.