salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls: 25
Models: 18
Evaluations: 450
Mean: 89.8

25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026

25 benchmark calls

The 25 calls

Open a call to read its answer key and how every model did on it.

Canva Competitive displacement discovery for edge security with Cloudflare

Competitive displacementflawed47m · 36 turns

SellerCloudflare

BuyerCanva

Design the call as a plausible but underperforming competitive displacement discovery. The Cloudflare seller is category-literate and can name relevant edge security/performance capabilities, but repeatedly converts buyer signals into a broad platform pitch. The biggest coaching opportunities are listening discipline, deeper current-state discovery, regional latency nuance, and turning competitive displacement into a measured migration plan rather than a feature tour.

Profile: Flawed
Flaws / Strengths: 4 / 1
Duration: 47m · 36 turns

What this call should surface

− flaw

Turns incumbent discussion into a Cloudflare feature monologue

Discovery · moderate

− flaw

Interrupts or flattens buyer nuance about regional latency

Communication Style · subtle

− flaw

Does not adequately qualify current architecture, ownership, or switching threshold

Qualification · moderate

− flaw

Proposes vague technical follow-up without a concrete migration-safe evaluation plan

Next Steps · moderate

+ strength

Shows credible edge security and performance category fluency

Technical Knowledge · obvious

36 speaker turns · 47m timeline

Transcript

The exact speaker-labeled transcript the coach models saw.

Marcus ChenSellerLeah MorrisonBuyerPriya RamanSellerDaniel OkaforBuyer

0:00
MC
Marcus Chen
Seller
Hi everyone, thanks for making the time. I’m Marcus Chen, I lead our strategic relationship efforts for Cloudflare with digital-native companies like Canva. Priya’s joined from our solutions engineering team as well. I know we’ve only got 30 minutes, so my goal isn’t to walk through every Cloudflare product. It’s really to understand, at a high level, how your current edge performance and security setup is working, where it’s solid, and where there might be gaps around global latency, bot activity, WAF, DDoS, API protection—those sorts of things. Then if there’s a fit, we can decide whether a deeper technical session makes sense. Does that work as a rough agenda?
2:38
LM
Leah Morrison
Buyer
Yeah, that works. I’m Leah, I look after parts of our edge platform and traffic routing. Mostly I’m here to sanity-check whether there’s anything materially different we should be looking at, especially around performance outside our strongest regions.
3:35
PR
Priya Raman
Seller
Hi, everyone — Priya Raman, solutions engineering at Cloudflare. I’ll mostly listen for the technical shape here: traffic patterns, cache behavior, security controls, and where a follow-up would need more depth.
4:22
DO
Daniel Okafor
Buyer
And I’m Daniel, security engineering. I’m mostly listening for the WAF, bot, API protection side — and how any change would avoid creating new operational risk.
5:02
MC
Marcus Chen
Seller
Great, thanks both. Leah, what does the current edge stack look like at a high level?
5:28
LM
Leah Morrison
Buyer
At the highest level, it’s a mix. We use a primary CDN and edge security provider for most public traffic, with some cloud-native controls closer to origin, and then a few specialized pieces around bot and abuse that Daniel’s team owns more directly. It’s not broken, to be clear. The incumbent setup is pretty embedded in our deploy tooling, cache purge flows, observability, incident runbooks — all the boring but important stuff. Where we still spend time is performance consistency by region and by journey. Static assets are one thing; the design editor, collaboration paths, exports, login, sharing links — those behave differently. So when someone says “CDN performance,” we usually have to unpack which part of the experience they mean.
8:22
MC
Marcus Chen
Seller
Yeah, that makes sense — and honestly that’s exactly where we see teams start to look at Cloudflare. Not because the current stack is on fire, but because there are a lot of moving parts between CDN, WAF, bot tooling, cloud controls, API protection, and then all the routing logic around it. Where Cloudflare tends to be different is that we’re running those controls on one global edge network, with CDN, DDoS, WAF, Bot Management, API Shield, Load Balancing, even Workers if you need edge logic closer to the user, all in the same control plane. So instead of tuning one provider for cache, another for bot decisions, another for app-layer rules, you can consolidate policy and get more consistent behavior globally. For a platform like Canva, with heavy media, collaboration, login and sharing flows, that can matter a lot for both latency and operational overhead.
11:52
LM
Leah Morrison
Buyer
Yeah, I get the appeal of a single control plane. I’d just be careful not to assume consolidation is the main problem. Some of the pain is very specific — like cache miss behavior in certain regions versus dynamic calls from the editor — and our current stack is predictable even when it’s not perfect.
13:13
MC
Marcus Chen
Seller
Totally, and I’m not saying rip-and-replace just for simplicity. I guess the point is, when those cache miss and dynamic-path issues show up regionally, having the security, caching, routing, and edge compute layers on the same network gives you more levers to tune. We see that a lot where teams start with one painful geography and then expand from there. Which regions are most visible for you right now?
14:53
LM
Leah Morrison
Buyer
The noisy ones move around, but India and parts of Southeast Asia come up a lot in our RUM views. LATAM is a bit different, and Australia/New Zealand is different again. And it’s not just “edge is far away,” right. For static template previews or exported assets, cache hit ratio and purge behavior matter more. For the editor, especially collaborative sessions, we’re looking at dynamic API paths, WebSocket-ish collaboration traffic, origin routing, and sometimes just mobile network variability. So p95 can look fine globally while p99 for a specific journey in, say, Mumbai or Jakarta is pretty ugly.
17:15
MC
Marcus Chen
Seller
Yeah, totally — so the common thread is the edge path isn’t consistent enough in those growth regions. That’s a place where our global network and routing layer can make a pretty material difference, because you’re not just relying on one CDN POP pattern; you can combine caching, load balancing, Workers for lightweight logic, and security decisions at the same edge. Priya can go deeper there too.
18:53
PR
Priya Raman
Seller
Yeah — maybe just to separate two things, Leah. For the Mumbai/Jakarta p99 cases, do you typically see that as cache miss and origin fetch time, or more on the dynamic API/collab path after the page is already warm? Those would lead us to very different test shapes.
20:04
LM
Leah Morrison
Buyer
Mostly the latter for the editor. The initial shell and a lot of media assets are usually warm enough, though cache misses still hurt for template previews and exports after purges. For the ugly p99s, it’s more often the dynamic path — API calls that end up taking a less ideal route back to origin, plus collaboration traffic where a small amount of jitter is really visible to the user. India is a good example where mobile network variability muddies the picture. Jakarta looks different. So we’d need to be pretty specific about what we were testing, otherwise the averages can make everything look fine.
22:36
MC
Marcus Chen
Seller
Yeah, that’s helpful. And I think that’s where Cloudflare can give you a broader set of knobs than a traditional CDN path — smart routing, load balancing, Workers at the edge for lightweight request handling, plus the WAF and bot decisions happening in the same place. So you’re not treating performance and security as separate hops in the chain.
24:02
DO
Daniel Okafor
Buyer
Can I jump in on the security side for a second? The single-place-to-enforce part is interesting, but for us the hard bit is proving we don’t introduce false positives or policy gaps during a cutover. Login, signup, sharing, API traffic — those behave very differently. So I’d want to understand what you’d actually measure there.
25:23
MC
Marcus Chen
Seller
Yeah, no, that’s exactly the right concern. Typically we’d baseline false positives, challenge rates, allowed versus blocked bot traffic, WAF rule matches, API anomalies — and then Cloudflare gives you Bot Management, managed WAF rules, API Shield, rate limiting, DDoS protection, logs into your SIEM, all from the same edge policy layer. So you can tune policies centrally rather than stitching decisions across tools.
26:57
DO
Daniel Okafor
Buyer
Right. The product coverage makes sense. I’m more asking how we’d prove parity before we trust it in front of users.
27:30
PR
Priya Raman
Seller
Yeah, fair distinction. I’d separate policy translation from enforcement. The usual safer pattern is log-only first: mirror the important WAF and bot policies, compare rule matches, challenge decisions, bot scores, and allowed/blocked outcomes against what your current stack is doing for login, signup, sharing, and API traffic. Then you only promote the narrow bits that look clean. We’d need enough event history and labels from your side to make that comparison meaningful, though.
29:16
DO
Daniel Okafor
Buyer
That’s closer, yeah. We’d need to see that against our real labels, not just synthetic traffic, but log-only is the right direction.
29:51
MC
Marcus Chen
Seller
Yeah, and the nice thing is you don’t have to think of that as a totally separate security exercise from the performance side Leah was describing. In a lot of these environments, Cloudflare is sitting in front for CDN, WAF, bot, API protection, DDoS, and then you can add Workers or load balancing where you need more control. So the evaluation can show both: do we improve edge posture and can we simplify the operating model versus the current mix.
31:47
LM
Leah Morrison
Buyer
Simplification is useful, but our current setup is pretty embedded in runbooks and observability. So for us it wouldn’t be “can Cloudflare do these things,” it’s more whether there’s a specific slice where the upside justifies the disruption.
32:44
MC
Marcus Chen
Seller
Yeah, that makes sense. We definitely wouldn’t be advocating a big-bang replacement. What I’ve seen work is we pick a lane where there’s already some pressure — maybe regional performance, maybe bot abuse, maybe API protection — and have our technical team walk you through how Cloudflare would sit alongside the current stack first. Then if the architecture feels right, you can decide whether it’s worth going deeper.
34:24
LM
Leah Morrison
Buyer
Yeah. I’d be hesitant to call it a lane until we’ve pinned down the workload. Regional performance, for example, means very different things for image delivery versus collaboration APIs.
35:08
MC
Marcus Chen
Seller
Totally, and I don’t want to overcomplicate it either. Whether it’s image delivery or collaboration APIs, the common thread is getting decisions and content closer to the user and reducing trips back to origin. That’s where Cloudflare’s network, cache controls, Workers, and load balancing can give you more knobs than a traditional CDN path. We can unpack which workload makes sense in the tech session.
36:43
LM
Leah Morrison
Buyer
Sort of, but that’s not quite the whole issue. For collaboration APIs, proximity helps less if the bottleneck is consistency, origin routing, or mobile network variability. That’s why I’m cautious about treating those as the same test.
37:39
PR
Priya Raman
Seller
Yeah, Leah, I agree with that distinction. For collaboration paths I’d be careful calling it a CDN test. We’d want to separate cacheable asset delivery from dynamic API behavior and look at p95 by region and endpoint class. Maybe the follow-up is just us looking at a couple of representative flows, not assuming one pattern covers both.
39:03
DO
Daniel Okafor
Buyer
That’s the minimum bar from my side. If we do a follow-up, I’d want it anchored on one or two real flows, not a general platform demo.
39:44
MC
Marcus Chen
Seller
Absolutely. We can keep it practical, not a generic demo. I’ll have Priya and one of our edge specialists come prepared to map a couple of flows against how Cloudflare would handle CDN, WAF, bot signals, API protection, logs, the whole thing. Then you can sanity-check whether there’s enough there to justify a deeper evaluation.
41:05
DO
Daniel Okafor
Buyer
Okay. I’d just be careful with “the whole thing.” If we take the time, I’d rather pick, say, login abuse or a sharing flow and understand what we’d measure.
41:50
MC
Marcus Chen
Seller
Fair point — yeah, let’s not boil the ocean. Login abuse is probably a good starting example, and we can also touch the sharing flow if time allows. I’ll send over a proposed agenda and we’ll make it more of a working session than a deck walk-through.
42:59
LM
Leah Morrison
Buyer
Okay, that’s fine. Send the agenda around and we’ll react to it. If it’s login abuse, Daniel’s team is obviously relevant; if you pull in performance claims, I’ll need to decide whether someone from SRE joins as well.
43:56
MC
Marcus Chen
Seller
Yep, that makes sense. I’ll keep the agenda centered on login abuse, with sharing as optional, and we’ll flag any performance-related pieces separately so you can decide who needs to be there.
44:45
DO
Daniel Okafor
Buyer
Okay. If you can send that with a couple of proposed measurement points, we’ll see who on our side should join.
45:18
MC
Marcus Chen
Seller
Will do. Priya and I’ll put a lightweight agenda together with a few suggested metrics around bot activity, false positives, and user friction, and I’ll send it over later today. Appreciate the time, both of you.
46:12
LM
Leah Morrison
Buyer
Thanks, Marcus. Send it through and we’ll take a look on our side. Talk soon.
46:37
MC
Marcus Chen
Seller
Thanks both. I’ll follow up by email later today. Have a good rest of your day.

Sorted by overall

How each model scored this call

Click a row to read the model's coaching note and the judge's read on it.

193gpt-5.5 highBestExcellent alignment with the hidden ground truth

Overall93

Needle recall96

Evidence grounding95

False-positive control93

Prioritization94

Actionability95

Sales instinct94

Technical accuracy94

How this model did

The coach output accurately identifies the call as credible but underperforming: Marcus is category-fluent, opens well, and secures a soft follow-up, but repeatedly converts Canva’s nuanced current-state signals into broad Cloudflare platform positioning. The coach strongly captured the core flaws around feature monologuing, flattening latency nuance, shallow displacement qualification, and insufficiently concrete migration-safe next steps. It also correctly credited Priya’s technical precision and the team’s category fluency. Minor issue: the coach slightly over-rewards the next step by calling it a useful earned follow-up and scoring it a 7, though it still clearly notes the lack of success criteria, owners, data requirements, and rollback path.

Strongest findings

Correctly identified the broad Cloudflare platform pitch after Leah’s incumbent/current-state description as the main discovery failure.
Accurately captured Marcus’s tendency to collapse Canva’s regional latency nuance into a generic edge-proximity narrative, with Leah’s correction as evidence.
Strongly diagnosed the displacement qualification gap: no deep exploration of incumbent strengths, pain sufficient to switch, ownership, timing, process, or success criteria.
Correctly distinguished Marcus’s product-list answer from Priya’s stronger validation-method answer on security parity and false positives.
Provided highly actionable coaching: reflect exact buyer nuance, ask one diagnostic question, narrow the wedge, define metrics, and answer proof questions with proof design.

Biggest misses

The coach slightly overstates the quality of the follow-up by calling it a useful earned next step and assigning a 7, when the hidden ground truth treats it as soft and low-commitment.
The coach could have been even sharper that the buyer, not Marcus, largely forced the narrowing of scope to login abuse and one or two real flows.
The coach did not explicitly frame the outcome as mixed-negative, though its substance strongly implies that view.

292gpt-5.4 xhighStrong pass: the coach accurately identified the core flawed-call pattern and grounded the feedback well in the transcript.

Overall92

Needle recall94

Evidence grounding96

False-positive control94

Prioritization91

Actionability94

Sales instinct93

Technical accuracy95

How this model did

The coach output is closely aligned with the hidden ground truth. It correctly frames the call as mixed: credible Cloudflare category knowledge and a decent opening, but undercut by Marcus repeatedly turning buyer signals into a broad platform narrative, flattening regional/workload nuance, and failing to fully qualify the incumbent displacement case. It also captures that Priya’s more precise technical interventions were stronger than Marcus’s broad positioning. The main calibration issue is that the coach is slightly generous on the next step, calling it a productive advancement and scoring it a 7, when the hidden benchmark emphasizes that the follow-up remained soft and insufficiently mutual-action-planned. Still, the coach also notes that the buyer forced the narrowing and that no date, inputs, owners, or full evaluation guardrails were secured, so this is a minor over-optimism rather than a miss.

Strongest findings

Correctly identifies capability stacking as the central sales flaw: Marcus repeatedly turns nuanced buyer cues into broad Cloudflare platform lists.
Accurately distinguishes Marcus’s weaker broad framing from Priya’s stronger technical precision around p99, cacheable assets, dynamic APIs, and log-only security validation.
Strongly captures the competitive displacement gap: Canva’s incumbent is embedded and predictable, yet Marcus never uncovers the outcome or proof threshold that would justify disruption.
Grounds claims in specific buyer corrections, especially Leah’s “consolidation is not the main problem” and “that’s not quite the whole issue,” and Daniel’s insistence on proving parity before trusting Cloudflare in front of users.
Provides actionable coaching: diagnose before differentiating, ask incumbent-displacement questions, narrow to one use-case hypothesis, and convert the next step into a mutual action plan.

Biggest misses

The coach is slightly generous on next-step quality. The benchmark treats the follow-up as soft and under-scoped; the coach calls it a productive advancement, though it does acknowledge missing date, inputs, confirmed attendees, and full evaluation criteria.
The coach could have emphasized migration safety even more explicitly around phased rollout, rollback, shadow/canary testing, and coexistence with the incumbent. It mentions rollback logic in the coaching plan, but the critique is less prominent than the hidden ground truth.
The coach credits the team’s security risk handling as a high strength because Priya proposed log-only policy comparison. This is transcript-grounded and fair, but it may slightly soften the broader benchmark point that the seller did not create a complete migration-safe POC plan.

391gpt-5.4 mediumStrong pass

Overall91

Needle recall93

Evidence grounding95

False-positive control91

Prioritization89

Actionability93

Sales instinct91

Technical accuracy92

How this model did

The coach output is highly aligned with the hidden ground truth. It correctly frames the call as credible but underperforming: Marcus shows category fluency yet repeatedly turns buyer signals into broad Cloudflare platform positioning, flattens Leah’s regional/workload latency nuance, under-qualifies the incumbent/switching threshold, and leaves with a follow-up that is useful but not a rigorous displacement evaluation plan. The coach is especially strong on evidence grounding and actionability. The main calibration issue is that it slightly over-rewards the quality of the next step and migration-risk handling because Priya did provide a strong log-only concept, but the final mutual action plan still lacked firm success criteria, stakeholders, data commitments, timeline, and rollback structure.

Strongest findings

Correctly identifies platform capability stacking as the central discovery failure in a competitive displacement call.
Accurately highlights Marcus flattening Leah’s nuanced regional/workload latency explanation into a generic edge-network narrative.
Strongly distinguishes Marcus’s broad framing from Priya’s better technical diagnosis around cacheable assets versus dynamic API/collaboration paths.
Correctly flags missing incumbent/switching-threshold discovery: what works today, what is painful enough to change, and what would justify migration disruption.
Provides highly actionable coaching: mirror buyer nuance, map one pain to one capability, ask switch-criteria questions, and build next steps around scoped evaluation metrics.

Biggest misses

The coach is a bit too generous on next-step quality and migration-risk handling relative to the hidden benchmark’s mixed-negative outcome bias.
It could have more sharply emphasized absent timing, renewal, commercial, procurement, and formal evaluation-process discovery.
It labels the call “ultimately productive,” which is directionally acceptable but slightly more positive than the benchmark’s intended soft, low-commitment follow-up.

491gpt-5.5 noneStrong pass with minor calibration issues

Overall91

Needle recall94

Evidence grounding95

False-positive control86

Prioritization91

Actionability93

Sales instinct89

Technical accuracy94

How this model did

The coach output correctly identifies the core hidden ground truth: this was a credible but under-disciplined competitive displacement call where Marcus repeatedly converted nuanced buyer signals into broad Cloudflare platform positioning, while Priya rescued several moments with sharper technical diagnosis. It captures all major flaws: feature stacking, flattened latency nuance, shallow qualification of incumbent/switching criteria, and an under-specified next step. The main weakness is calibration: the coach sometimes frames the outcome as a “good discovery outcome” and scores the next step more generously than the benchmark’s mixed-negative/soft-follow-up intent. Still, the coaching is highly transcript-grounded, technically accurate, and actionable.

Strongest findings

Correctly identifies product stacking and broad consolidation messaging as the central sales flaw, with strong transcript evidence from Marcus’s repeated Cloudflare capability lists.
Strongly captures the regional latency nuance problem, especially Leah’s distinction between cacheable assets, dynamic collaboration/API paths, p99 by geography, origin routing, and mobile network variability.
Accurately highlights Priya’s value: she asks the more diagnostic questions and proposes safer validation patterns, especially log-only comparison for security policy parity.
Correctly diagnoses missing displacement qualification: what works with the incumbent, what is painful enough to change, who owns decisions, and what measured outcome would justify disruption.
Provides actionable next-step coaching: scope login abuse, define metrics, identify data prerequisites and stakeholders, and avoid turning the follow-up into a broad platform demo.

Biggest misses

The coach could have been more explicitly negative about the follow-up quality as a low-commitment soft next step rather than a fairly successful close.
It did not emphasize shadow traffic/canary/rollback/coexistence as much as the benchmark’s migration-safe evaluation plan calls for, though it does mention log-only comparison and rollback in recommendations.
The overall scoring tone is somewhat generous relative to the hidden ground truth’s “flawed” and “mixed-negative” profile.

589opus 4.7 xhighstrong coach output with minor over-crediting

Overall89

Needle recall92

Evidence grounding86

False-positive control84

Prioritization91

Actionability93

Sales instinct90

Technical accuracy90

How this model did

The coach correctly diagnosed the call as credible but underperforming: Marcus repeatedly converted Canva’s nuanced incumbent/performance/security signals into broad Cloudflare capability positioning, while Priya provided the more disciplined technical discovery and migration-risk framing. The output hits nearly all hidden benchmark needles, especially feature monologuing, flattened latency nuance, and shallow displacement qualification. The main weakness is that it somewhat overstates the strength/concreteness of the next step and contains one unsupported invented detail about Daniel hinting at policy cutover ownership.

Strongest findings

Correctly identified Marcus’s broad Cloudflare platform pitching after incumbent/current-state cues as the central discovery failure.
Strongly captured Leah’s regional latency nuance and Marcus’s tendency to collapse static assets, dynamic APIs, collaboration traffic, origin routing, and mobile variability into a generic proximity/global-edge argument.
Correctly flagged missing displacement qualification: incumbent strengths/gaps, switching threshold, ownership, decision process, and pain magnitude.
Accurately praised Priya’s diagnostic interventions, especially separating cacheable asset delivery from dynamic API/collaboration paths and proposing log-only comparison for security policy parity.
Provided highly actionable coaching: reflect before pitching, use incumbent-gap questions, let the SE lead proof/migration-risk moments, and tighten the follow-up artifact with metrics and rollback criteria.

Biggest misses

The coach somewhat over-rewarded the next step. Hidden ground truth frames the outcome as mixed-negative and low-commitment; the coach called the next step relatively concrete and gave it a 7.
One piece of evidence was invented: Daniel never raised or hinted at “who owns the policy cutover.”
The coach could have been even sharper that Marcus never explored timing, renewal windows, contractual constraints, commercial drivers, or procurement process, though it did mention decision process and budget generally.
The migration/POC critique was present but partially softened by crediting Priya’s log-only answer as if it compensated for the lack of a full mutual action plan.

688gpt-5.5 lowStrong, mostly aligned with the hidden benchmark, with mild over-credit for the next step.

Overall88

Needle recall90

Evidence grounding94

False-positive control88

Prioritization89

Actionability92

Sales instinct90

Technical accuracy92

How this model did

The coach correctly diagnosed the central pattern: Marcus was credible and category-fluent but repeatedly converted Canva’s nuanced displacement signals into broad Cloudflare platform positioning. It captured the key buyer corrections around regional latency, the security measurement/parity issue, shallow displacement qualification, and Priya’s stronger diagnostic contributions. The main weakness is that the coach was somewhat too generous about the call outcome and next-step quality, describing it as a “qualified next step” and scoring next steps fairly high even though Canva only agreed to review a lightweight agenda and no concrete POC, timeline, data plan, rollback approach, or decision criteria were secured.

Strongest findings

Correctly identifies Marcus’s broad Cloudflare platform/consolidation reflex after Canva describes an embedded incumbent stack.
Strongly grounds the regional-latency coaching in Leah’s correction that collaboration APIs are not solved simply by edge proximity.
Accurately highlights Daniel’s measurement/parity concern and Marcus’s initial mistake of answering with product coverage instead of measurement design.
Recognizes Priya’s value in separating cacheable asset delivery from dynamic API/collaboration paths and policy translation from enforcement.
Provides actionable coaching drills and follow-up questions that map closely to the missed discovery and POC-planning gaps.

Biggest misses

The coach should have been harsher on the softness of the next step; it was an agenda-review commitment, not a true mutual action plan or qualified pilot.
It under-emphasized missing commercial and process qualification such as renewal timing, procurement, contractual constraints, decision process, and risk tolerance.
It somewhat over-framed the discovery as “solid,” even though much of the useful detail came from buyer corrections and Priya rather than Marcus-led discovery.

788sonnet 4.6Strong, mostly aligned with the benchmark, with one notable over-credit on next-step quality.

Overall88

Needle recall86

Evidence grounding93

False-positive control84

Prioritization88

Actionability92

Sales instinct89

Technical accuracy90

How this model did

The coach correctly diagnosed the central pattern: Marcus was credible and category-literate but repeatedly converted buyer nuance into Cloudflare platform positioning before fully qualifying Canva’s current state, incumbent gaps, latency specifics, and switching threshold. The output is well grounded in transcript evidence and gives actionable coaching. The main miss is that it treats the final follow-up as relatively strong and concrete, whereas the benchmark frames it as a soft, low-commitment next step that still lacks a migration-safe evaluation plan, success criteria, stakeholder commitment, timeline, and rollback/coexistence thinking.

Strongest findings

Correctly identified Marcus’s repeated capability stacking and consolidation pitch as the core discovery failure.
Strongly diagnosed the lack of incumbent-vendor discovery, switching-threshold qualification, why-now exploration, and stakeholder mapping.
Accurately captured Leah’s repeated corrections when Marcus flattened distinctions between cacheable media delivery, dynamic APIs, collaboration traffic, routing, and mobile variability.
Correctly highlighted Priya’s diagnostic questions and log-only security framing as the best examples of technical credibility and buyer-centered response.
Provided practical coaching drills and follow-up questions that are directly tied to transcript misses.

Biggest misses

The coach underweighted the benchmark’s weak-next-step/migration-plan critique by portraying the close as meaningfully concrete and scoring it too positively.
It did not sufficiently emphasize missing migration-safety elements such as phased rollout, shadow/canary approach, rollback plan, coexistence with the incumbent, data prerequisites, owners, and timeline.
The overall tone, “qualified success,” is slightly more positive than the benchmark’s mixed-negative outcome bias, where Canva earns only a soft exploratory follow-up and not confidence in a serious displacement path.

887gpt-5.4 highStrong / mostly aligned with minor calibration issues

Overall87

Needle recall92

Evidence grounding91

False-positive control84

Prioritization84

Actionability90

Sales instinct88

Technical accuracy89

How this model did

The coach correctly diagnosed the central pattern in the hidden ground truth: Marcus is credible and category-literate, but repeatedly converts nuanced Canva signals into a broad Cloudflare platform/consolidation pitch. The output is well grounded in transcript evidence and captures the main coaching needs around precision before pitch, incumbent-displacement discovery, latency nuance, and risk-proof evaluation. The main weakness is calibration: the coach is somewhat too generous on the call outcome and next-step quality, scoring next-step alignment and migration risk management higher than the benchmark intent warrants. Still, the coach identified every hidden needle at least partially and most of them strongly.

Strongest findings

Correctly identifies Marcus's core habit of converting buyer nuance into broad Cloudflare platform/consolidation messaging.
Excellent grounding on the latency nuance issue, especially the distinction between static asset caching and dynamic collaboration/API paths.
Strong diagnosis of weak displacement discovery: no clear incumbent gap, switching trigger, business impact, or ownership map.
Accurately credits Priya's diagnostic contributions without letting them erase Marcus's discovery and positioning issues.
Actionable coaching plan: precision-before-pitch drills, competitor discovery sequence, proof-plan answers, and mutual action planning.

Biggest misses

The coach should have weighted the weak migration/POC next-step flaw more heavily and scored next-step alignment lower.
The overall call-outcome framing is slightly too favorable versus the benchmark's mixed-negative intent.
The coach could have more explicitly called out the lack of renewal/commercial/procurement/timing qualification, though it did mention these gaps generally.

986gpt-5.5 xhighStrong, mostly benchmark-aligned coaching with mild over-optimism on outcome and next-step quality.

Overall86

Needle recall88

Evidence grounding91

False-positive control82

Prioritization87

Actionability90

Sales instinct84

Technical accuracy91

How this model did

The coach correctly recognized the core pattern in the hidden ground truth: Marcus was credible and category-literate, but repeatedly converted nuanced discovery signals into broad Cloudflare platform positioning. The output is well grounded in the transcript and identifies the most important issues: feature stacking, flattened latency nuance, incomplete incumbent-displacement qualification, and the need for a narrower measurement-led evaluation. The main weakness is calibration: the coach describes the call as a “good call with a positive outcome” and gives relatively high marks to next steps and risk handling, whereas the benchmark frames the outcome as mixed-negative and only a soft, low-commitment follow-up. Still, the substance of the coaching is largely accurate and actionable.

Strongest findings

Accurately identified feature stacking as the dominant sales behavior problem, especially Marcus listing many Cloudflare capabilities after buyer discovery cues.
Correctly highlighted Leah’s correction that collaboration API performance cannot be treated the same as cacheable asset delivery or generic edge proximity.
Strongly captured Daniel’s core concern: proof of parity and false-positive control mattered more than product coverage.
Well-grounded recognition that Priya improved the call by asking sharper technical diagnostic questions and proposing log-only comparison.
Actionable coaching plan: reflect buyer nuance, ask a diagnostic question, limit product mention, qualify displacement threshold, and build measurement-led next steps.

Biggest misses

The coach did not fully align with the benchmark’s mixed-negative outcome calibration; it treated the follow-up as more credible and positive than the transcript supports.
The weakness of the POC/migration plan should have been more central. The coach identified it, but its scoring and summary softened the issue.
The coach could have been more explicit about missing commercial and process qualification: renewal windows, timeline, procurement, contractual constraints, decision process, and switching economics.
The coach praised “good initial current-state discovery” appropriately as an opening move, but should have stressed that Marcus failed to follow Leah’s rich answer with the necessary second- and third-level probes.

1086opus 4.7 maxMostly accurate with one material calibration issue

Overall86

Needle recall84

Evidence grounding88

False-positive control78

Prioritization87

Actionability91

Sales instinct88

Technical accuracy86

How this model did

The coach correctly diagnosed the core pattern in the benchmark: Marcus is credible and category-literate, but repeatedly spends that knowledge too early through platform/feature stacking, flattens Canva’s regional latency nuance, and fails to do enough displacement qualification. The output is well grounded in transcript evidence and offers actionable coaching. The main miss is that it over-rewards the end-of-call next step as relatively concrete and credible; the hidden benchmark treats that follow-up as still soft, buyer-forced, and lacking a real migration-safe evaluation plan with agreed success criteria, owners, data, timeline, rollout guardrails, or rollback path.

Strongest findings

Accurately identifies Marcus’s recurring feature stacking after buyer-specific signals, with strong transcript evidence.
Correctly highlights that Leah’s regional latency nuance was flattened into generic edge-network/proximity messaging.
Strongly diagnoses the missed discovery around incumbent embeddedness: deploy tooling, purge flows, observability, and runbooks.
Recognizes the important contrast between Marcus’s capability answers and Priya’s methodology-based technical discovery.
Provides highly actionable coaching drills, especially single-pain/single-capability responses and consequence-ladder questions.

Biggest misses

The coach under-penalizes the weak end-of-call follow-up and treats it as more concrete than the benchmark supports.
It does not emphasize enough that there was no real migration-safe POC plan: no shadow/canary, phased rollout, rollback, data prerequisites, or agreed success thresholds.
It somewhat overstates buyer commitment to a log-only login-abuse evaluation; the transcript supports interest, not commitment.
It occasionally uses deal-impact language such as 'rescued' or 'saved' that is plausible but more speculative than strictly transcript-grounded.

1184gpt-5.4 noneStrong evaluation with one important calibration issue: the coach correctly found the main listening/product-pitching flaws and grounded them well, but over-rewarded the quality of the follow-up and migration-safe evaluation plan.

Overall84

Needle recall87

Evidence grounding91

False-positive control76

Prioritization82

Actionability88

Sales instinct79

Technical accuracy90

How this model did

The coach output aligns closely with the hidden ground truth on the core pattern: Marcus is category-literate but repeatedly turns buyer nuance into a broad Cloudflare platform/consolidation pitch, forcing Leah and Daniel to clarify what they actually meant. The coach also accurately credits Priya for the best consultative moments. The main weakness is outcome calibration. The hidden benchmark frames the call as mixed-negative with only a soft, low-commitment follow-up; the coach describes it as a more positive/credible next step and gives high scores for next-step control and handling migration risk. The coach did note missing metrics and current-state gaps, but did not weight those as heavily as the benchmark intended.

Strongest findings

Correctly identified Marcus’s repeated tendency to convert buyer current-state signals into broad Cloudflare platform positioning.
Strongly grounded the latency-nuance issue with Leah’s corrections about cacheable assets versus dynamic APIs/collaboration traffic.
Accurately called out Daniel’s proof/parity question being answered first with product inventory rather than methodology.
Appropriately credited Priya for the call’s best technical discovery and risk-reduction moments.
Useful actionable coaching plan around precision listening, methodology before features, sharper wedge creation, and current-state discovery.

Biggest misses

Underweighted the weak qualification of switching threshold, timeline, renewal/commercial constraints, evaluation process, and internal decision path.
Over-rewarded the follow-up despite the absence of a concrete POC, success criteria, stakeholder commitment, timeline, rollback plan, or data-sharing agreement.
Treated Priya’s log-only security parity suggestion as stronger migration-risk handling than the overall displacement context supports.
Discovery and next-step scores were too generous for a call the benchmark intended as flawed and mixed-negative.

1284gpt-5.5 mediumpass_with_caveats

Overall84

Needle recall88

Evidence grounding92

False-positive control78

Prioritization86

Actionability90

Sales instinct84

Technical accuracy90

How this model did

The coach output is strongly grounded and identifies the main behavioral pattern: Marcus is credible but repeatedly turns Canva’s nuanced displacement discovery into broad Cloudflare positioning, while Priya’s more precise interventions rescue parts of the call. It hits the feature-monologue, latency-nuance, qualification, and category-fluency needles well. The main weakness is that it slightly over-rewards the ending as a “qualified” or “concrete” next step; the transcript supports only a soft follow-up agenda, not a real migration-safe evaluation plan with owners, baselines, success criteria, data prerequisites, and rollback/canary mechanics.

Strongest findings

Correctly identifies Marcus’s broad platform stacking after Leah’s incumbent/current-state cue.
Correctly catches Leah’s repeated corrections that Canva’s latency problems differ by workload, region, endpoint class, and cause.
Correctly contrasts Marcus’s product-coverage answer to Daniel with Priya’s stronger proof-oriented log-only validation answer.
Correctly notes missing displacement qualification around incumbent strengths, operational disruption, success criteria, ownership, and business justification.
Provides actionable coaching drills that map well to the transcript: reflect nuance before positioning, design proof plans before naming SKUs, and scope POCs around one real flow.

Biggest misses

The coach is too positive on the call outcome and next step. Hidden ground truth expects mixed-negative, soft follow-up, not a clearly qualified next step.
The weak migration-safe evaluation plan should have been treated as a primary flaw, not mostly as an area to tighten after a good recovery.
The coach could have more explicitly called out missing timeline, renewal/commercial constraints, procurement/security review, and formal evaluation process.

1384opus 4.7 mediumStrong judge pass with one material calibration issue

Overall84

Needle recall87

Evidence grounding86

False-positive control80

Prioritization82

Actionability91

Sales instinct84

Technical accuracy88

How this model did

The coach output largely identifies the hidden benchmark: Marcus is knowledgeable but over-pitches Cloudflare’s platform, flattens Canva’s technical nuance, misses deeper displacement qualification, and relies on Priya/buyer discipline to keep the call specific. The main weakness is that the coach over-rewards the next step as a strong commitment, whereas the benchmark treats it as a soft, buyer-enforced, under-scoped follow-up lacking a migration-safe POC plan, agreed success metrics, timeline, owners, or rollback path.

Strongest findings

Correctly identified that Marcus continued the consolidation/single-control-plane thesis after Leah explicitly warned not to assume consolidation was the main problem.
Strongly captured the feature-stacking pattern: multiple Cloudflare capabilities named in sequence instead of one buyer pain tied to one measurable outcome.
Accurately highlighted Leah’s correction that image delivery and collaboration APIs are not the same performance problem.
Correctly praised Priya’s log-only mirroring answer as the clearest response to Daniel’s parity/false-positive concern.
Useful coaching recommendations: reflect region/journey/metric before pitching, ask threshold questions, map buying group, and ask what the incumbent does well.

Biggest misses

The coach over-calibrated the outcome as “salvaged” and the next step as strong; the benchmark is more mixed-negative and views the follow-up as soft and under-scoped.
The weak migration-safe evaluation plan should have been a central risk, not a low-severity missed opportunity. There was no co-created POC with success criteria, data requirements, phased rollout, or rollback plan.
The coach could have more explicitly emphasized that Marcus never established a real switching threshold: what pain is severe enough to justify displacement of an embedded incumbent?
The coach gave Priya deserved credit, but at times that partially softened the assessment of Marcus’s discovery performance in a seller-coaching context.

1484opus 4.7 highMostly strong, but materially over-rewards the close/next step.

Overall84

Needle recall86

Evidence grounding87

False-positive control76

Prioritization80

Actionability91

Sales instinct82

Technical accuracy89

How this model did

The coach output accurately identifies the main benchmark pattern: Marcus is category-literate but repeatedly converts Canva’s nuanced current-state signals into broad Cloudflare platform/consolidation messaging. It is well grounded on feature-stacking, shallow qualification, latency nuance, and Priya’s stronger technical discovery. The main weakness is calibration on the late-call next step: the hidden ground truth treats the follow-up as soft and under-specified for a displacement/migration-safe evaluation, while the coach repeatedly calls it a strong close and gives it an 8/10. That overstates the quality of the mutual action plan and underplays the absence of concrete POC guardrails, owners, timeline, data requirements, and rollback/coexistence details.

Strongest findings

Excellent identification of Marcus’s feature-stacking: the coach correctly ties the repeated product lists to a failure to diagnose incumbent gaps before pitching Cloudflare breadth.
Strong handling of latency nuance: the coach spotlights the key credibility issue, especially Marcus’s 'common thread' framing and Leah’s correction that proximity does not solve every collaboration/API bottleneck.
Very good qualification critique: the coach identifies missing incumbent names, renewal windows, RUM tooling, operational embeddedness, ownership, and 'what works well today.'
Accurate praise for Priya: the coach correctly identifies Priya’s cache-miss-versus-dynamic-path question and log-only parity proposal as the most technically credible seller moments.
Actionable coaching plan: the recommended drills and follow-up questions are practical, account-specific, and mostly grounded in the transcript.

Biggest misses

The coach underweights the weak mutual action plan. The final next step is an agenda, not a migration-safe POC with co-created metrics, data requirements, owners, timeline, and rollback/coexistence guardrails.
The coach’s 8/10 next-step score conflicts with the hidden benchmark’s mixed-negative outcome bias. The buyer’s agreement is polite and low-commitment, not evidence of serious displacement momentum.
The coach treats the login-abuse scope as more buyer-validated than it really is. Daniel suggested it as an example, and Marcus accepted, but the measurement criteria were not actually agreed.
The coach does not sufficiently call out the absence of migration mechanics: shadow traffic/log-only is mentioned by Priya for security parity, but there is no broader phased rollout, rollback, or coexistence plan tied to a displacement motion.

1583gpt-5.4 lowMostly accurate, with one material over-credit on next steps and migration rigor

Overall83

Needle recall84

Evidence grounding84

False-positive control78

Prioritization82

Actionability90

Sales instinct83

Technical accuracy87

How this model did

The coach output aligns well with the hidden ground truth: it correctly identifies Marcus’s tendency to turn discovery into Cloudflare platform breadth, his flattening of Canva’s nuanced regional/performance distinctions, the incomplete current-state qualification, and the genuine category fluency/technical credibility shown by the Cloudflare team—especially Priya. The main weakness is that the coach overstates the quality of the close and risk handling. The benchmark frames the follow-up as soft and under-scoped, whereas the coach calls it a “concrete next step” and gives risk/change management a high score. The coach does partially flag that scope narrowing was reactive and migration mechanics were underdeveloped, but it does not weight that flaw strongly enough. There is also one unsupported/hallucinated quote about Daniel asking who owns policy cutover.

Strongest findings

Correctly identifies Marcus’s repeated platform-breadth/product-listing behavior after buyer current-state and proof signals.
Accurately captures the key latency nuance problem: Marcus collapses static asset, dynamic API, collaboration, origin-routing, and mobile variability issues into a generic edge-network narrative.
Well-grounded praise of Priya’s technical precision, especially separating cacheable assets from dynamic collaboration paths and proposing log-only validation for security policy parity.
Good identification of incomplete current-state discovery around architecture, ownership, incumbent gaps, and switching threshold.
Actionable coaching plan with useful drills: reflect before positioning, answer proof questions with evaluation design, and narrow scope earlier.

Biggest misses

The coach underweights the hidden benchmark’s concern that the final follow-up was weak, low-commitment, and not a true migration-safe POC plan.
It praises the close too strongly despite the lack of agreed stakeholders, timeline, data prerequisites, rollback guardrails, or mutual success criteria.
It includes one fabricated/unsupported attribution that Daniel asked who owns policy cutover.
It could have more explicitly called out missing commercial/contractual qualification such as renewal window, procurement process, timeline, and decision criteria.

1678gemini 3.1 pro previewGood but incomplete. The coach correctly diagnosed the central failure mode—Marcus feature-dumping and flattening Canva’s nuanced needs—but under-called the qualification and migration-planning gaps and over-rewarded the softness of the next step.

Overall78

Needle recall74

Evidence grounding88

False-positive control78

Prioritization86

Actionability82

Sales instinct84

Technical accuracy83

How this model did

The coach output is well grounded on the most important visible pattern: Marcus repeatedly turns buyer signals into a broad Cloudflare platform/consolidation pitch, while Priya does the better technical discovery. It also catches the buyer corrections around consolidation and the late-call attempt to broaden the follow-up. However, it does not fully surface the deeper displacement qualification gaps: current architecture detail, incumbent strengths, switching threshold, stakeholder/process/timing, and commercial or migration constraints. It also praises the next step too much; Canva only agreed to review a lightweight agenda, with no mutual action plan, timeline, committed stakeholders, data-sharing plan, success criteria, rollback plan, or scoped POC beyond a tentative login-abuse working session.

Strongest findings

Correctly identifies Marcus’s repeated feature dumping and broad Cloudflare platform pitch as the central problem.
Accurately uses Leah’s correction—consolidation is not necessarily the main problem—as evidence that Marcus was not adapting to buyer feedback.
Correctly credits Priya for the more precise technical discovery around Mumbai/Jakarta p99s and static versus dynamic paths.
Correctly flags Daniel’s pushback on ‘the whole thing’ as evidence that Marcus was trying to make the follow-up too broad.

Biggest misses

Did not fully diagnose the lack of competitive displacement qualification: incumbent strengths, pain severe enough to switch, decision process, renewal/timing, commercial constraints, and security/procurement review were largely unexplored.
Did not sufficiently emphasize that the next step was low-commitment and not a mutual action plan.
Did not call out enough missing current-state detail around architecture, traffic patterns, cloud regions, API surfaces, observability/runbooks, and cross-functional ownership.
Could have more explicitly coached Marcus to ask what Canva’s current setup does well before contrasting Cloudflare.
Could have better separated Marcus’s category fluency from Priya’s technical rescue; Marcus knew the category, but used that knowledge too early and too broadly.

1778opus 4.7 lowMostly accurate, with one important contradiction

Overall78

Needle recall82

Evidence grounding84

False-positive control72

Prioritization76

Actionability88

Sales instinct74

Technical accuracy86

How this model did

The coach correctly diagnosed the central pattern: Marcus is credible but repeatedly converts Canva’s nuanced current-state signals into Cloudflare platform/capability stacking, while Priya provides the more precise technical discovery. The output is well grounded on the latency-nuance and security-parity moments and gives actionable coaching. The main miss is that it over-rewards the next step as a “clean, scoped, measurable” advance. Hidden ground truth treats the follow-up as soft and underdeveloped: no concrete POC plan, limited stakeholder commitment, no timeline, no data prerequisites, and no migration/rollback guardrails beyond Priya’s log-only security suggestion.

Strongest findings

Correctly identified Marcus’s repeated capability stacking when buyers asked for precision.
Strongly captured Leah’s corrections around cacheable assets versus dynamic collaboration/API paths and regional p99 nuance.
Correctly praised Priya’s diagnostic questions and log-only security-parity framing as the most credible technical moments.
Actionable coaching plan: reflect-before-respond, answer the buyer’s verb, quantify pain, and use the SE earlier.

Biggest misses

The coach materially over-praised the close and next step, which the benchmark treats as soft and underdeveloped.
It did not sufficiently emphasize the absence of a migration-safe mutual action plan with workload, timeline, data, owners, rollout/rollback, and success criteria.
It framed the call as more successful than the hidden ground truth’s mixed-negative outcome bias; Canva’s agreement was polite and low-commitment, not a strong advance.

1878deepseek v4 proWorstMostly aligned, with one material miss on next-step quality.

Overall78

Needle recall74

Evidence grounding90

False-positive control76

Prioritization78

Actionability86

Sales instinct77

Technical accuracy87

How this model did

The coach correctly diagnosed the central pattern: Marcus was credible but too quick to convert discovery signals into Cloudflare platform positioning, and he flattened Canva’s nuanced regional/static-vs-dynamic performance concerns. The output is well grounded in transcript evidence and gives actionable coaching. However, it materially over-rewards the close, calling the follow-up “solid” and scoring next steps highly, when the benchmark expects a weak, low-commitment follow-up without a migration-safe POC plan, success criteria, owners, timeline, shadow/canary approach, or rollback guardrails.

Strongest findings

Correctly identifies the early Cloudflare feature laundry list as a discovery failure after Canva described an embedded incumbent stack.
Accurately spots Leah’s pushback that consolidation is not necessarily the main problem.
Strongly captures Marcus’s flattening of static asset delivery versus dynamic collaboration/API paths.
Uses well-chosen transcript evidence, especially the “whole thing” and “common thread” quotes, to show buyer resistance.
Provides actionable coaching drills around asking one diagnostic question before positioning Cloudflare.

Biggest misses

The coach materially over-rewards next steps; the benchmark expects a weak, soft follow-up, not a strong close.
It does not fully develop the missing displacement qualification: switching threshold, renewal/timing, decision process, procurement/security review, ownership, commercial constraints, and incumbent strengths.
It underemphasizes migration-safe evaluation planning: phased rollout, shadow/log-only comparison as a formal plan, rollback, workload/region scope, and data prerequisites.
It praises the late login-abuse narrowing more than the transcript supports; Daniel, not Marcus, largely forced that focus.