salesevals.com/Evaluated Jul 1, 2026

Which models know sales?

26 model configurations coach GPT- and Sonnet-generated synthetic sales calls with hidden ground truth. A judge scores each coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls: 50
Models: 26
Evaluations: 1300
Benchmark: 86.2

50 calls · 1300 evaluationsRank: Sales coaching benchmarkAll available runsBuild-time static dataEvals completed Jul 1, 2026

Synthetic cases, hidden answer keys, semantic judging

Benchmark methodology

How the calls were generated, coached, and judged.

This benchmark tests sales coaching instincts, not transcript summarization. Each case is a synthetic B2B sales conversation generated from company research, persona design, and a hidden coaching answer key.

Coach models receive the setup, research, participants, and speaker-labeled transcript. They do not receive the ground-truth labels, hidden needles, evaluator notes, or transcript-generator label. The judge receives both the coach output and the hidden ground truth, then scores semantically.

The current static dataset includes 25 calls from the original GPT-based generation suite and 25 matching calls generated with Claude Sonnet 4.6. Rankings default to all available runs, so rows with broader coverage carry a larger n.

Generated calls: 50
Judged runs: 1300
Models: 26
Needles: 266
Duration range: 18-74m
Avg duration: 43m

From scenario to Zoom-like case

Generation pipeline

The mocked-zoom app uses Workflow DevKit steps and Vercel AI Gateway for structured generation.

Scenario input

A suite scenario defines seller company, buyer company, call type, duration, turn count, and the intended quality profile.

Research brief

The generator runs web research for both companies and asks an LLM to produce a concise, source-grounded sales-call brief.

Hidden eval design

Before any transcript is written, an LLM designs 2 to 6 coaching needles: strengths, flaws, expected evidence, anti-evidence, and coaching implications.

Personas

Seller and buyer personas inherit the hidden coaching signals so their behavior can naturally create or pressure-test those needles.

Transcript turns

The conversation is generated one turn at a time. Each turn chooses the next speaker and writes only that speaker's next spoken contribution.

Artifacts

The completed call is rendered into VTT transcript, replay HTML, audio when available, video placeholder, Zoom-like recording files, and manifest JSON.

Coach run

Each coach model gets the visible setup, research, participants, and transcript with speaker labels. Hidden ground truth is excluded.

Judge run

The judge compares the coach output to hidden ground truth, credits semantic matches, penalizes unsupported claims, and returns an eight-axis scorecard.

What is in the static dataset

Benchmark coverage

Calls are bucketed by the input call type and quality profile rather than by inferred labels.

Transcript generator

GPT-generated25

Sonnet-generated25

Call type

Discovery16

Product demo20

Renewal save4

QBR4

Competitive displacement6

Quality profile

Excellent18

Mixed14

Flawed18

What the judge knows

Ground truth

The hidden answer key is intended to create sales-coaching needle-in-the-haystack problems.

Total needles: 266
Flaws: 129
Strengths: 137

Discovery47

Next Steps47

Technical Knowledge35

Qualification31

Value Alignment26

Research23

Objection Handling22

Communication Style16

Executive Alignment12

Customer Enablement7

Coach models plus one judge configuration

Models and scoring

Every visible coach output is judged against the same hidden case material.

Claude Fable 5: 1; high
Claude Opus 4.7: 5; low, medium, high, xhigh, max
Claude Opus 4.8: 5; low, medium, high, xhigh, max
Claude Sonnet 4.6: 1; default
Claude Sonnet 5: 1; default
DeepSeek V4 Pro: 1; default
Gemini 3.1 Pro Preview: 1; default
GLM 5.2: 1; default
GPT-5.4: 5; none, low, medium, high, xhigh
GPT-5.5: 5; none, low, medium, high, xhigh

The exported run set includes GPT-5.4, GPT-5.5, Claude Opus 4.8, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview configurations. The judge configuration is GPT-5.5 with high reasoning.

The main leaderboard is ranked by a sales coaching benchmark score: 25% overall score, 20% needle recall, 20% sales instinct, 15% prioritization, 10% technical accuracy, and 10% false-positive control. Raw average score and downside floor stay visible as supporting evidence.

What this site does and does not show

Guardrails

The results site is intentionally a static, shareable view over the benchmark output.

Call pages include the speaker-labeled transcript shown to the coach models, plus setup, hidden needles, and model scores.
No real customer calls are included. The cases are synthetic, but they are generated from research and realistic persona dynamics. The export includes both GPT-generated and Sonnet-generated transcript suites.
The judge scores semantically and awards partial credit. It does not use string matching.
Audio and video artifacts exist in the generation app, but the current score tables evaluate transcript-grounded coaching output.