Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
Synthetic cases, hidden answer keys, semantic judging

Benchmark methodology

How the calls were generated, coached, and judged.

This benchmark tests sales coaching instincts, not transcript summarization. Each case is a synthetic B2B sales conversation generated from company research, persona design, and a hidden coaching answer key.

Coach models receive the setup, research, participants, and speaker-labeled transcript. They do not receive the ground-truth labels, hidden needles, or evaluator notes. The judge receives both the coach output and the hidden ground truth, then scores semantically.

Generated calls
25
Judged runs
450
Models
18
Needles
135
Duration range
18-74m
Avg duration
43m
From scenario to Zoom-like case

Generation pipeline

The mocked-zoom app uses Workflow DevKit steps and Vercel AI Gateway for structured generation.

01

Scenario input

A suite scenario defines seller company, buyer company, call type, duration, turn count, and the intended quality profile.

02

Research brief

The generator runs web research for both companies and asks an LLM to produce a concise, source-grounded sales-call brief.

03

Hidden eval design

Before any transcript is written, an LLM designs 2 to 6 coaching needles: strengths, flaws, expected evidence, anti-evidence, and coaching implications.

04

Personas

Seller and buyer personas inherit the hidden coaching signals so their behavior can naturally create or pressure-test those needles.

05

Transcript turns

The conversation is generated one turn at a time. Each turn chooses the next speaker and writes only that speaker's next spoken contribution.

06

Artifacts

The completed call is rendered into VTT transcript, replay HTML, audio when available, video placeholder, Zoom-like recording files, and manifest JSON.

07

Coach run

Each coach model gets the visible setup, research, participants, and transcript with speaker labels. Hidden ground truth is excluded.

08

Judge run

The judge compares the coach output to hidden ground truth, credits semantic matches, penalizes unsupported claims, and returns an eight-axis scorecard.

What is in the static dataset

Benchmark coverage

Calls are bucketed by the input call type and quality profile rather than by inferred labels.

Call type

Discovery8
Product demo10
Renewal save2
QBR2
Competitive displacement3

Quality profile

Excellent9
Mixed7
Flawed9
What the judge knows

Ground truth

The hidden answer key is intended to create sales-coaching needle-in-the-haystack problems.

Total needles
135
Flaws
64
Strengths
71
Discovery23
Next Steps23
Technical Knowledge20
Value Alignment17
Qualification15
Objection Handling11
Executive Alignment9
Research7
Communication Style6
Customer Enablement4
Coach models plus one judge configuration

Models and scoring

Every visible coach output is judged against the same hidden case material.

Claude Opus 4.7
5
low, medium, high, xhigh, max
Claude Sonnet 4.6
1
default
DeepSeek V4 Pro
1
default
Gemini 3.1 Pro Preview
1
default
GPT-5.4
5
none, low, medium, high, xhigh
GPT-5.5
5
none, low, medium, high, xhigh

The exported run set includes GPT-5.4, GPT-5.5, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, and Gemini 3.1 Pro Preview configurations. The judge configuration is GPT-5.5 with high reasoning.

What this site does and does not show

Guardrails

The results site is intentionally a static, shareable view over the benchmark output.

  • Call pages include the speaker-labeled transcript shown to the coach models, plus setup, hidden needles, and model scores.
  • No real customer calls are included. The cases are synthetic, but they are generated from research and realistic persona dynamics.
  • The judge scores semantically and awards partial credit. It does not use string matching.
  • Audio and video artifacts exist in the generation app, but the current score tables evaluate transcript-grounded coaching output.