Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
25 benchmark calls

The 25 calls

Open a call to read its answer key and how every model did on it.

Amazon Cloud operating model discussion for internal platform teams with HashiCorp

Discoveryflawed26m · 22 turns
SellerHashiCorp
BuyerAmazon

The call should sound professionally executed on the surface, with the HashiCorp seller able to explain Terraform, Vault, policy guardrails, and cloud operating model concepts. The hidden flaw is that the seller treats Amazon like a less mature enterprise, becomes too confident teaching governance basics, and does not sufficiently investigate Amazon’s actual internal constraints, build-vs-buy posture, AWS-native tooling, or platform-team operating realities. The buyer should appear sophisticated and polite but progressively less engaged because the seller is not adapting.

Profile
Flawed
Flaws / Strengths
5 / 1
Duration
26m · 22 turns

What this call should surface

flaw

Misses Amazon-specific internal constraints

Discovery · subtle

flaw

Overconfident governance lecture to a sophisticated buyer

Communication Style · moderate

flaw

Positions HashiCorp broadly instead of complementing AWS-native and internal tooling

Value Alignment · moderate

flaw

Weak qualification despite senior enterprise context

Qualification · subtle

flaw

Vague follow-up instead of mutual action plan

Next Steps · moderate

+ strength

Fluent but insufficiently tailored HashiCorp platform explanation

Technical Knowledge · obvious

22 speaker turns · 26m timeline

Transcript

The exact speaker-labeled transcript the coach models saw.

Marissa KleinSellerDevon PatelSellerAnjali RaoBuyerMichael TanakaBuyer
  1. MK

    Marissa Klein

    Seller

    Hi everyone, thanks for making the time. I’m Marissa Klein with HashiCorp, I lead the Amazon relationship on our side. I know your teams are operating at a level of scale and sophistication that’s pretty different from a typical enterprise cloud conversation, so we’ll try to keep this practical. My hope today is to spend a few minutes on what we mean by a cloud operating model for internal platform teams, hear where that’s relevant or not relevant for you, and then, if there’s a useful thread, talk about whether a deeper technical session makes sense. Devon’s here with me from our solutions engineering team as well.

  2. DP

    Devon Patel

    Seller

    Thanks, Marissa. Hi all, I’m Devon, solutions engineer on the HashiCorp side. I mostly spend time with platform and security teams on Terraform and Vault operating patterns, so I’ll jump in if we get into workflow or architecture details.

  3. AR

    Anjali Rao

    Buyer

    Thanks, Marissa. I’m Anjali Rao, I lead one of the internal platform engineering groups here. We support paved-road workflows for a few Amazon business units, mostly around provisioning, service onboarding, and developer experience. I’m mainly here to understand where you think HashiCorp adds value beyond patterns we already have internally or get natively from AWS.

  4. MT

    Michael Tanaka

    Buyer

    Hey, Michael Tanaka here. I’m in platform security, mostly looking at secrets, identity boundaries, and auditability. I’m interested in where this would sit alongside what teams already get natively or through our internal control planes.

  5. MK

    Marissa Klein

    Seller

    Got it. Are you mainly looking to standardize provisioning workflows across those business units?

  6. AR

    Anjali Rao

    Buyer

    Not exactly. We already have fairly opinionated paved roads for a lot of that, and teams have autonomy where they need it. The question is less “can we standardize provisioning” and more where standardization actually reduces operational burden without becoming another control plane people have to work around. So I’d separate baseline workflow consistency from the harder parts: exception paths, ownership boundaries, and maintenance cost of internal tooling.

  7. MK

    Marissa Klein

    Seller

    Yeah, that makes sense. And honestly that’s very consistent with what we see in large platform organizations: the initial win is getting teams onto a paved road, but the operating model shift is really about making that paved road consumable through standard workflows, with policy guardrails embedded so teams aren’t waiting on tickets or interpreting controls differently. That’s where we tend to see Terraform Enterprise or HCP Terraform come in — module registry, workspace patterns, approvals, policy as code — not to slow teams down, but to give them a consistent way to provision while the platform team maintains control over the standards.

  8. AR

    Anjali Rao

    Buyer

    Right, we do something similar today. The tricky bit is usually not the workflow primitive, it’s deciding where another layer is worth owning.

  9. MK

    Marissa Klein

    Seller

    Yeah, exactly. And I think that’s where having a common workflow layer can reduce the amount of bespoke glue your platform teams have to maintain. The way we usually frame it is: teams still get self-service and autonomy, but the platform team publishes approved modules, embeds policy checks before provisioning, and has a central audit trail for what changed and who approved it. So it’s less another control plane for developers and more a governed path that abstracts the complexity underneath.

  10. MT

    Michael Tanaka

    Buyer

    Where would that policy decision actually live in your model? Because for us, the hard part usually isn’t having a policy check; it’s ownership of exceptions and evidence when teams already have native controls and internal control planes in the path.

  11. DP

    Devon Patel

    Seller

    Yeah, I can take that. In our model the decision point is usually in the Terraform workflow itself — so Sentinel or OPA-style checks in the run, plus approvals where a policy requires human review. We’re not trying to become the source of truth for every control you already have; we’d typically integrate with existing identity, ticketing, logging, and cloud-native controls, and then provide a consistent evidence trail around the infrastructure change: what module was used, what policy evaluated, who approved the exception, and what actually got applied. So it tends to sit alongside the internal control plane rather than replacing it.

  12. MT

    Michael Tanaka

    Buyer

    Okay, that’s helpful. The distinction for us is whether that evidence trail is additive versus duplicative.

  13. MK

    Marissa Klein

    Seller

    Exactly, and that’s the bar we’d want to clear. What we typically see is that once the workflow is standardized, the evidence becomes much cleaner because it’s attached to the provisioning event instead of scattered across tickets, logs, and team-specific processes. So even where you have strong native controls, Terraform can give the platform team a common system of record for infrastructure intent, policy evaluation, and approvals. Devon mentioned this, but the key is you’re not asking every service team to reinvent that pattern.

  14. AR

    Anjali Rao

    Buyer

    Yeah, I think we understand the pattern. For us the question is still which domains would benefit from an external system of record versus the ones where our existing platform signals are already authoritative.

  15. MK

    Marissa Klein

    Seller

    Totally. And we wouldn’t suggest every domain needs to move into a HashiCorp-managed workflow. The places we usually see the most value are provisioning, secrets, and secure access patterns where teams want a repeatable control and audit model. Maybe the useful thing is for us to show how that operating model looks across Terraform and Vault, and you can react to where it feels additive versus duplicative.

  16. AR

    Anjali Rao

    Buyer

    That’s probably fine. I’d keep it lightweight, though — maybe send the Terraform/Vault operating model material and we can sanity-check internally where, if anywhere, it maps.

  17. MK

    Marissa Klein

    Seller

    Yep, that works. I’ll send over the Terraform and Vault operating model deck, plus a couple of examples of policy and approval workflows. And if it looks relevant after your internal sanity check, we can set up a lighter technical session with Devon and whoever on your platform or security side wants to react to it.

  18. MT

    Michael Tanaka

    Buyer

    Yeah, and from my side, maybe just flag the evidence and exception-handling pieces. I’m less interested in a Vault 101, more in where teams actually see net-new signal.

  19. MK

    Marissa Klein

    Seller

    Absolutely — that’s a good filter. We’ll keep it focused on the operating model, evidence trails, exception workflows, and where Terraform or Vault tends to produce signal beyond the native controls. I’ll send that over after this, and then Anjali, you can tell us if it’s worth pulling a few folks into a follow-up.

  20. AR

    Anjali Rao

    Buyer

    Okay, that sounds good. Send it over, and I’ll route it to the right folks internally. If there’s a clear fit, we’ll come back with a narrower agenda.

  21. MK

    Marissa Klein

    Seller

    Perfect. Thanks, Anjali, Michael — appreciate the time today. I’ll package that up and send it by end of day, and we’ll stay flexible on whether a follow-up makes sense after you’ve had a look.

  22. AR

    Anjali Rao

    Buyer

    Great, thanks both. We’ll look for the email and take it from there. Have a good rest of your day.

Sorted by overall

How each model scored this call

Click a row to read the model's coaching note and the judge's read on it.

194gpt-5.4 mediumBestStrong pass: the coach output closely matches the hidden ground truth and is well grounded in the transcript.
Overall94
Needle recall95
Evidence grounding97
False-positive control96
Prioritization95
Actionability96
Sales instinct94
Technical accuracy93
How this model did

The coach correctly saw that the call was polished and technically credible on the surface but strategically weak for a sophisticated Amazon audience. It identified the central failure: the sellers did not convert Amazon’s cues about exception handling, ownership boundaries, internal tooling maintenance, additive evidence, and native/internal control planes into deeper discovery or a qualified next step. The output also appropriately gave limited credit for the opening humility and Devon’s complement-not-replace answer, rather than over-penalizing every moment. Minor gaps: the coach could have been even more explicit about missing active initiative, budget, timeline, and decision-process qualification, but it covered the broader qualification weakness well.

Strongest findings
  • Correctly identified the main strategic failure: the sellers did not pursue Amazon’s specific internal constraints after the buyer handed them concrete discovery threads.
  • Very strong evidence grounding, with the most important buyer quotes cited and interpreted accurately.
  • Balanced assessment of complementarity: the coach praised Devon’s strong answer while still noting that HashiCorp’s differentiation was not mapped to a specific Amazon gap.
  • Excellent diagnosis of the weak close: sending a deck and waiting for Amazon to come back is not a mutual action plan.
  • Actionable coaching recommendations were specific, including suggested follow-up questions, role-play drills, and a better end-of-call evaluation lens.
Biggest misses
  • The coach could have named missing budget, timeline, active initiative, and decision-process qualification more explicitly under the qualification critique.
  • The coach could have more directly described the buyer’s progressive disengagement or polite distancing, though it did capture the key signals such as “we understand the pattern” and “less interested in a Vault 101.”
293opus 4.7 mediumExcellent coaching evaluation; it captured the central hidden flaw and most benchmark needles with strong transcript grounding.
Overall93
Needle recall92
Evidence grounding95
False-positive control93
Prioritization94
Actionability94
Sales instinct95
Technical accuracy94
How this model did

The coach correctly avoided over-crediting the seller for polish and technical fluency. It identified that Marissa did not adapt to Amazon’s sophistication, over-explained familiar governance concepts, missed Amazon-specific discovery opportunities, and accepted a low-commitment next step. It also gave appropriate limited credit to Devon’s concrete answer about Terraform policy checks, evidence trails, and additive positioning. The only minor limitation is that the coach could have framed the AWS-native/internal-tooling complementarity issue as its own explicit value-alignment risk, though it did cover the substance through “net-new signal,” “additive vs duplicative,” and lack of specific Amazon constraints.

Strongest findings
  • Correctly identified the central strategic weakness: the seller sounded polished but failed to adapt to Amazon’s sophistication.
  • Strong transcript grounding around Anjali’s corrective cues: “not exactly,” “we already have paved roads,” and “we understand the pattern.”
  • Accurately praised Devon’s concrete technical answer while not letting that redeem the overall discovery failure.
  • Correctly diagnosed the close as a soft deck-send rather than a qualified next step.
  • Actionable coaching plan with specific drills for discovery discipline, pre-call hypotheses, and mutual next-step definition.
Biggest misses
  • The coach could have more explicitly separated the AWS-native/internal-tooling complementarity issue as a distinct value-alignment flaw, rather than mostly folding it into generic value articulation and net-new-signal comments.
  • It did not deeply analyze the tension between Marissa’s humble opening and her later failure to sustain that humility, though it did mention this as a strength that the rest of the call did not fulfill.
393opus 4.7 maxExcellent alignment with the hidden benchmark
Overall93
Needle recall94
Evidence grounding96
False-positive control94
Prioritization95
Actionability93
Sales instinct95
Technical accuracy92
How this model did

The coach correctly identified the central hidden failure mode: a polished, technically credible HashiCorp conversation that did not earn relevance with a highly sophisticated Amazon platform audience. It captured the lack of deep discovery, the tendency to re-explain generic cloud operating model concepts, the unresolved additive-vs-duplicative question, weak qualification, and the vague send-the-deck close. It also fairly credited the technically strong and complementary answer from Devon without letting that obscure the strategic weakness of the call. Minor gaps: the coach could have been even more explicit about active initiative/budget/timeline qualification and the broader build-vs-buy/AWS-native posture, but these are small omissions in an otherwise very strong evaluation.

Strongest findings
  • Correctly identified Anjali’s “exception paths, ownership boundaries, and maintenance cost” comment as the single richest missed discovery opening.
  • Correctly diagnosed the seller’s repeated generic operating-model explanation after buyers signaled they already understood the pattern.
  • Fairly credited Devon’s integration-first response as the strongest moment while still treating the overall positioning as under-tailored.
  • Accurately called out the unresolved “additive versus duplicative” evaluation bar and recommended turning it into a concrete test.
  • Strongly identified the weak close: send materials, possible follow-up, no named use case, no success criteria, and no mutual action plan.
Biggest misses
  • The coach could have more explicitly called out missing qualification around active initiative, budget/resource commitment, decision process, and timeline, though it did capture weak qualification overall.
  • The coach could have tied the positioning flaw even more directly to Amazon’s AWS-native tooling and build-vs-buy posture. It did address internal control planes and native controls, but this could have been sharpened.
  • The technical-strength discussion emphasized Devon’s answer more than Marissa’s broader accurate Terraform/HCP Terraform/Vault operating-model explanation, though the conclusion remained correct.
492opus 4.7 highstrong
Overall92
Needle recall91
Evidence grounding95
False-positive control91
Prioritization94
Actionability93
Sales instinct94
Technical accuracy91
How this model did

The coach output closely matches the hidden ground truth. It correctly sees the call as polished but strategically weak: the sellers explain HashiCorp’s cloud operating model fluently, but fail to adapt to Amazon’s sophistication, under-discover the buyer’s actual internal constraints, and settle for a low-commitment send-materials follow-up. The coach is especially strong on the central flaw: Anjali and Michael repeatedly hand the sellers advanced cues about exception ownership, internal control planes, evidence, and maintenance burden, yet Marissa keeps returning to generic platform-governance explanation. The main gap is that qualification discipline is called out, but not as fully as the benchmark would want around initiative status, decision process, timeline, budget/resource commitment, and success criteria.

Strongest findings
  • Correctly identifies Anjali’s “exception paths, ownership boundaries, and maintenance cost” comment as the most important missed discovery moment.
  • Correctly diagnoses Marissa’s pattern of acknowledging sophisticated buyer signals and then reverting to a generic cloud operating model pitch.
  • Appropriately praises Devon’s answer as the best moment because it directly addresses policy decision location, integration with existing controls, and complementary positioning.
  • Accurately labels the final outcome as polite but low-conviction: send materials, sanity-check internally, maybe follow up.
  • Provides practical coaching moves, especially the reset phrase: “I’m probably telling you things your teams already know — where does governance or self-service still create friction at your scale?”
Biggest misses
  • The coach could have more explicitly scored qualification against enterprise deal criteria: active initiative, evaluation owner, decision process, timeline, budget/resource commitment, and measurable success criteria.
  • The coach might have separated Marissa’s weak qualification from the next-step issue more cleanly; both are related but distinct benchmark flaws.
  • The coach’s critique of “generic value claims around evidence and audit” is valid, but the transcript also shows Devon handled part of that theme well, so the critique should remain balanced as the coach mostly does.
592opus 4.7 xhighstrong pass
Overall92
Needle recall93
Evidence grounding95
False-positive control94
Prioritization96
Actionability94
Sales instinct93
Technical accuracy92
How this model did

The coach output closely matches the hidden ground truth. It correctly avoids over-crediting the polished HashiCorp pitch, identifies the central failure to adapt to Amazon’s sophistication, and grounds the critique in buyer cues around existing paved roads, exception ownership, internal tooling maintenance, and additive-versus-duplicative evidence. The biggest gap is that formal qualification issues were covered mostly through next-step critique rather than explicitly calling out absence of initiative status, budget/resources, timeline, decision process, and ownership.

Strongest findings
  • Correctly identified the core flaw: the sellers sounded polished but failed to adapt to Amazon’s sophistication or uncover Amazon-specific constraints.
  • Strong evidence use around Anjali’s “exception paths, ownership boundaries, and maintenance cost of internal tooling” cue and Michael’s “additive versus duplicative” criterion.
  • Accurately distinguished Devon’s strong complementary technical answer from Marissa’s more generic operating-model repetition.
  • Correctly downgraded the close as a buyer-controlled send-the-deck outcome rather than a real mutual action plan.
  • Actionable coaching recommendations were practical: pivot from teaching to probing, mirror buyer-stated criteria, test complementarity hypotheses, and strengthen next steps.
Biggest misses
  • Formal qualification could have been called out more explicitly: no active initiative, decision process, economic/resource commitment, timeline, executive sponsor, or defined evaluation owner was established.
  • The coach could have more directly stated that the sellers left without a validated business impact or priority level, not just without a sharper technical agenda.
  • The technical strength assessment focused heavily on Devon; Marissa’s Terraform explanation was technically coherent too, even though poorly calibrated.
691gpt-5.4 highstrong
Overall91
Needle recall89
Evidence grounding95
False-positive control93
Prioritization94
Actionability95
Sales instinct92
Technical accuracy94
How this model did

The coach output is highly aligned with the hidden ground truth. It correctly sees that the call was polished and technically credible on the surface but strategically weak because the sellers did not sufficiently diagnose Amazon-specific friction, over-relied on broad cloud operating model explanations, failed to land a narrow additive wedge versus AWS-native/internal systems, and closed on a low-commitment materials follow-up. The main gap is that the coach underdevelops the formal qualification miss around initiative status, decision process, timeline, budget/resources, and evaluation ownership.

Strongest findings
  • Correctly identifies the central strategic weakness: the sellers preserved credibility but did not earn relevance with Amazon-specific discovery.
  • Strongly grounds the critique in buyer cues such as “we already do something similar,” “we understand the pattern,” “additive versus duplicative,” and “where another layer is worth owning.”
  • Accurately distinguishes between good technical credibility and poor value mapping, especially around Devon’s solid policy/evidence answer that was not converted into deeper discovery.
  • Provides actionable coaching moves: turn technical answers back into discovery, isolate a narrow wedge, and replace collateral-only follow-up with a focused working session.
Biggest misses
  • The formal qualification flaw is only partially developed. The coach mentions weak qualification symptoms but does not explicitly assess active initiative, evaluation process, timeline, budget/resource commitment, or decision authority.
  • The coach may slightly over-credit audience calibration with an 8/10 because the strong humble opener did not carry through into the body of the call, where adaptation remained limited.
791gpt-5.4 xhighExcellent / strong benchmark alignment
Overall91
Needle recall90
Evidence grounding95
False-positive control96
Prioritization93
Actionability94
Sales instinct92
Technical accuracy90
How this model did

The coach output correctly identifies the central hidden flaw: HashiCorp sounded polished and technically credible but failed to adapt to Amazon’s sophistication, diagnose Amazon-specific constraints, or turn the conversation into a qualified next step. It is well grounded in transcript evidence, especially Anjali’s and Michael’s repeated cues around exception handling, ownership boundaries, maintenance cost, and additive-versus-duplicative evidence. The main gap is that the coach could have been more explicit about formal qualification misses such as initiative status, decision process, budget/resources, and timeline.

Strongest findings
  • Correctly names the core failure mode: polished platform-governance explanation without enough Amazon-specific diagnosis.
  • Uses the most important buyer cues as evidence: “Not exactly,” “we already have fairly opinionated paved roads,” “exception paths,” “ownership boundaries,” “maintenance cost,” and “additive versus duplicative.”
  • Balances the critique well by crediting Devon’s coexistence/evidence-trail answer instead of unfairly claiming HashiCorp positioned itself only as a replacement.
  • Accurately identifies the weak close as a passive materials review rather than a qualified technical follow-up or mutual action plan.
  • Provides highly actionable coaching questions and drills that align with the actual missed discovery paths.
Biggest misses
  • The coach could have more explicitly flagged the absence of formal qualification around active initiative, decision process, timeline, budget/resource commitment, and executive or evaluation ownership.
  • The coach’s technical-strength discussion focuses heavily on Devon’s answer; it could also have explicitly credited Marissa’s accurate Terraform operating-model explanation while still noting it was too generic.
  • No major unsupported claims or material benchmark misses were present.
890gpt-5.5 noneStrong pass
Overall90
Needle recall88
Evidence grounding95
False-positive control94
Prioritization91
Actionability93
Sales instinct89
Technical accuracy92
How this model did

The coach output closely matches the hidden ground truth. It correctly sees the call as polished and technically credible on the surface but strategically weak because the sellers did not deeply diagnose Amazon-specific constraints, over-relied on a generic cloud operating model narrative, failed to prove additive value versus native/internal tooling, and ended with a lightweight collateral-based follow-up. The main limitation is that the coach somewhat softened the severity of the “teaching governance to Amazon” issue and did not fully develop qualification gaps around initiative status, decision process, timeline, or budget. Overall, it is well grounded, nuanced, and actionable.

Strongest findings
  • Accurately identified that Anjali’s “not exactly” correction should have triggered diagnostic discovery instead of more operating-model explanation.
  • Correctly elevated “additive versus duplicative” as the buyer’s central evaluation criterion and a missed opportunity for qualification.
  • Strongly grounded the weak next-step critique in the actual close: send materials, Amazon sanity-checks internally, possible later lightweight session.
  • Balanced praise and criticism well by crediting Devon’s technically sound complement-not-replace answer without overvaluing it as a qualified opportunity.
Biggest misses
  • The coach could have been more explicit about enterprise qualification gaps: no active initiative, sponsor, budget/resource commitment, decision process, timeline, or formal evaluation criteria were established.
  • The coach slightly softened the hidden flaw around overconfident governance explanation by describing the call as “not tone-deaf” and “solid,” though it still captured the underlying issue.
  • The positioning critique was nuanced and mostly fair, but it could have more directly stated that broad operating-model positioning remained risky for Amazon despite Devon’s good complement-not-replace language.
990gpt-5.5 mediumStrong judge pass: the coach identified the central hidden flaw and grounded it well, with only minor under-emphasis on formal qualification and the breadth/complementarity issue.
Overall90
Needle recall88
Evidence grounding96
False-positive control93
Prioritization92
Actionability94
Sales instinct91
Technical accuracy95
How this model did

The coaching output correctly saw past the polished HashiCorp explanation and flagged the strategic weakness: Amazon repeatedly signaled sophistication and a need to understand additive value, exception ownership, authoritative systems, and internal tooling cost, while the seller continued too much generic operating-model explanation. It hit the main discovery, lecture/adaptation, next-step, and technical-fluency needles with strong transcript evidence. The main gaps are that it was slightly generous about the seller's complementary positioning because Devon did give one good non-replacement answer, but the team still did not map HashiCorp to a specific AWS-native/internal tooling gap; and it only partially developed the formal qualification miss around initiative status, decision process, timeline, budget/resources, and success criteria.

Strongest findings
  • Correctly identified Anjali’s correction about exception paths, ownership boundaries, and internal tooling maintenance cost as the central missed discovery opening.
  • Correctly treated buyer phrases like “we do something similar” and “we understand the pattern” as signals to stop explaining and pivot to Amazon-specific discovery.
  • Strongly grounded the critique of next steps: materials and a conditional follow-up were not a mutual action plan.
  • Accurately credited Devon’s technically precise answer on policy decision points, Sentinel/OPA-style checks, approvals, and integration with existing controls.
  • Actionable coaching plan was strong, including drills for pivoting after maturity signals and building an additive-versus-duplicative evaluation frame.
Biggest misses
  • Formal qualification could have been sharper: the coach did not fully call out missing initiative status, decision process, timeline, budget/resource commitment, and evaluation ownership.
  • The coach somewhat softened the AWS-native/internal tooling complementarity flaw by elevating Devon’s one good integration answer into a broader strength, even though the seller still failed to validate a specific complementary use case.
  • The coach’s tone was a bit generous in saying the team avoided major missteps; strategically, the main misstep was meaningful for an Amazon-level buyer.
1089gpt-5.4 noneStrong evaluation with minor underweighting of the audience-calibration flaw.
Overall89
Needle recall90
Evidence grounding94
False-positive control87
Prioritization92
Actionability93
Sales instinct91
Technical accuracy90
How this model did

The coach largely identified the hidden benchmark’s core issue: the call sounded polished and technically credible but failed to diagnose Amazon-specific constraints, qualify a real opportunity, or create a concrete mutual next step. The output is well grounded in transcript evidence and appropriately prioritizes discovery depth, additive-vs-duplicative positioning, and weak next-step control. The main limitation is that the coach somewhat over-credits the seller for avoiding a lecture; the seller did continue a generic governance/platform explanation after repeated buyer cues that Amazon already understood the pattern.

Strongest findings
  • Correctly identified that Anjali gave the sellers a clear discovery roadmap—exception paths, ownership boundaries, and internal-tooling maintenance—but the sellers did not probe it.
  • Correctly centered Michael’s “additive versus duplicative” comment as the key evaluation criterion for Amazon.
  • Accurately criticized the close as a generic collateral handoff rather than a mutual action plan.
  • Balanced technical credit with strategic critique: the coach did not mistake fluent Terraform/Vault explanation for a strong enterprise sales outcome.
  • Provided actionable follow-up questions that would have improved the call, especially around exception workflows, evidence gaps, and build-vs-buy pressure.
Biggest misses
  • The coach underweighted the hidden tone/audience-calibration issue: the seller’s generic governance explanation should have been framed more sharply as a credibility risk with an Amazon-level buyer.
  • Qualification critique was good but could have been more explicit about missing active initiative, decision process, budget/resource commitment, timeline, and evaluation ownership.
  • The coach’s high executive-presence score is defensible on polish but slightly generous given that the seller did not adapt quickly enough after advanced buyer cues.
1189gpt-5.5 highStrong judge-aligned coaching with minor under-emphasis on formal qualification and AWS-native/internal tooling mapping.
Overall89
Needle recall87
Evidence grounding94
False-positive control92
Prioritization90
Actionability93
Sales instinct88
Technical accuracy91
How this model did

The coach correctly recognized the central hidden issue: the call sounded polished and technically credible, but HashiCorp did not sufficiently adapt to Amazon’s sophistication or discover a specific Amazon-scale constraint. The output is well grounded in transcript evidence, especially around Anjali’s cues on exception paths, ownership boundaries, maintenance cost, and Michael’s additive-versus-duplicative criterion. The main gaps are that the coach somewhat over-credited the sellers for complementing existing systems and did not fully press the formal qualification misses around active initiative, decision process, timeline, budget/resources, and success criteria.

Strongest findings
  • Correctly prioritized the missed discovery after Anjali named exception paths, ownership boundaries, and internal tooling maintenance cost.
  • Accurately identified that the sellers continued a generic operating-model explanation after Amazon signaled it already understood the pattern.
  • Strongly captured Michael’s “additive versus duplicative” comment as the real buying criterion that should have shaped qualification and follow-up.
  • Correctly criticized the close as a low-commitment deck send rather than a mutual action plan tied to a validated use case.
  • Used transcript quotes consistently and proposed practical coaching drills and alternative questions.
Biggest misses
  • The coach could have been more explicit that no active initiative, timeline, budget/resource commitment, executive sponsor, or formal evaluation process was qualified.
  • It somewhat over-credited the complement-not-replace positioning; Devon’s answer was good, but the team still failed to map HashiCorp to a specific AWS-native or internal-platform gap.
  • It could have more directly stated that the seller left without a precise pain, affected team, business impact, or success metric.
  • It did not emphasize as strongly as the benchmark that Amazon’s build-vs-buy posture and internal tooling economics needed direct exploration.
1289sonnet 4.6Strong evaluation with minor gaps
Overall89
Needle recall88
Evidence grounding93
False-positive control90
Prioritization92
Actionability91
Sales instinct88
Technical accuracy89
How this model did

The coach output correctly identifies the central hidden issue: HashiCorp sounded polished and technically fluent but failed to uncover Amazon-specific constraints or qualify a real opportunity. It is well grounded in transcript evidence, especially around Anjali’s signals about exception paths, ownership boundaries, maintenance cost, and Michael’s concern about additive versus duplicative evidence. The main limitations are that the coach somewhat softens the hidden complementarity/build-vs-buy flaw because Devon did provide a credible additive positioning answer, and it treats qualification mostly through the lens of weak next steps rather than fully calling out lack of initiative, timeline, owner, budget, and success criteria.

Strongest findings
  • Correctly identified Anjali’s “exception paths, ownership boundaries, and maintenance cost of internal tooling” comment as the richest missed discovery thread in the call.
  • Correctly warned that “we understand the pattern” should have triggered a pivot from explanation to discovery.
  • Accurately characterized the close as weak and unstructured, with the ball left in Amazon’s court.
  • Fairly praised Devon’s additive, integration-aware answer to Michael’s policy decision/evidence question without letting that positive moment obscure the broader discovery weakness.
  • Provided actionable coaching drills and replacement questions that map well to the actual missed moments in the transcript.
Biggest misses
  • The coach could have more explicitly separated general weak next steps from broader enterprise qualification gaps: active initiative, evaluation owner, timeline, budget/resource commitment, decision process, and success metrics.
  • The coach partially softened the hidden complementarity flaw by emphasizing Devon’s strong answer; it could have stated more sharply that one additive positioning answer did not validate where HashiCorp complements Amazon’s native and internal platforms in a specific domain.
  • Some suggested missed opportunities, such as acquired-business or heterogeneous-environment discovery, are useful hypotheses but should remain clearly framed as hypotheses because the transcript itself did not establish those as Amazon’s actual pain.
1388opus 4.7 lowStrong pass
Overall88
Needle recall89
Evidence grounding92
False-positive control88
Prioritization91
Actionability90
Sales instinct89
Technical accuracy91
How this model did

The coach output substantially matches the hidden ground truth. It correctly sees that the call was polished but strategically weak: Marissa over-explained generic platform/governance concepts to a highly sophisticated Amazon team, missed explicit buyer cues about exception handling, ownership boundaries, maintenance cost, and additive evidence, and left with only a lightweight materials-review next step. The feedback is well grounded in transcript evidence and prioritizes the right coaching areas. The main gaps are that the coach could have more explicitly scored the lack of formal qualification—initiative status, owner, timeline, budget/resources, and success criteria—and could have more directly framed the AWS-native/internal-platform complementarity issue as a qualification/value-positioning gap rather than mostly as a Devon strength.

Strongest findings
  • Correctly identifies Anjali’s “not exactly” and “we do something similar today” responses as corrective signals that Marissa failed to pursue.
  • Correctly calls out the missed opportunity to drill into “maintenance cost of internal tooling,” which was one of the clearest buyer-provided discovery openings.
  • Accurately distinguishes Devon’s stronger complementary technical answer from Marissa’s more generic operating-model explanation.
  • Correctly interprets the final deck-send as a soft next step rather than a qualified follow-up.
  • Provides actionable coaching drills, especially the rule that after a buyer says “we already…,” the seller must respond with a question rather than another explanation.
Biggest misses
  • Could have more explicitly addressed formal qualification gaps: active initiative, decision owner, evaluation process, timeline, budget/resource commitment, and success metrics.
  • Could have made the HashiCorp-as-complement issue more central: what exactly would be additive to AWS-native and Amazon internal systems, and where would HashiCorp be duplicative?
  • Some missed-opportunity examples, like acquired businesses, non-AWS environments, and secrets sprawl, are reasonable but partly extrapolated rather than directly evidenced as buyer-stated needs.
1486deepseek v4 proStrong evaluation with one notable gap on formal qualification
Overall86
Needle recall84
Evidence grounding92
False-positive control94
Prioritization89
Actionability90
Sales instinct84
Technical accuracy91
How this model did

The coach output correctly caught the central hidden issue: the seller sounded polished but failed to earn relevance with a very sophisticated Amazon audience. It was well grounded in transcript evidence around Anjali’s and Michael’s cues, the generic operating-model explanation, and the passive send-materials close. The biggest weakness is that the coach treated “discovery and qualification” mostly as pain discovery and did not explicitly call out missing enterprise qualification elements such as active initiative, decision process, owner, timeline, budget/resources, or success criteria.

Strongest findings
  • Correctly identified the central flaw: HashiCorp explained a generic cloud operating model instead of diagnosing Amazon-specific constraints at scale.
  • Strong transcript grounding around Anjali’s cues: “exception paths,” “ownership boundaries,” “maintenance cost,” and “where another layer is worth owning.”
  • Accurately flagged the mature-buyer calibration risk: Amazon repeatedly signaled they already understood the basics.
  • Very strong critique of the close: the next step was a passive materials send, not a mutual action plan.
  • Fairly credited Devon’s technical answer while still holding the overall call accountable for weak discovery.
Biggest misses
  • The coach did not explicitly evaluate formal qualification: active initiative, decision process, evaluation owner, budget/resource commitment, timeline, and success criteria.
  • The complementarity critique could have been sharper around AWS-native tooling, homegrown control planes, build-vs-buy posture, and identifying a specific wedge use case.
  • The coach could have more clearly stated that the seller left without a named constraint, affected team, current workaround, business impact, or priority level.
1584gpt-5.4 lowStrong coaching output with a few under-called flaws
Overall84
Needle recall80
Evidence grounding94
False-positive control86
Prioritization85
Actionability92
Sales instinct82
Technical accuracy94
How this model did

The coach correctly identified the central strategic weakness: HashiCorp sounded credible but failed to dig into Amazon-specific constraints after the buyers gave clear signals around exception handling, ownership boundaries, maintenance burden, additive evidence, and native/internal tooling. The output is well grounded in transcript quotes and provides useful, actionable coaching. Its main limitation is that it somewhat over-credits the seller’s executive handling and complementarity positioning, and it does not fully call out the governance-lecture dynamic or disciplined opportunity qualification gaps around initiative status, owner, timeline, budget, and success criteria.

Strongest findings
  • Correctly identifies Anjali’s early statement about exception paths, ownership boundaries, and internal-tooling maintenance cost as the key missed discovery opening.
  • Correctly flags that the conversation never landed on a specific Amazon domain, pain, business impact, or technical gap.
  • Accurately praises Devon’s technical answer while still noting that technical fluency did not equal strategic relevance.
  • Strongly diagnoses the passive deck-send next step and recommends a more specific working session tied to evidence trails and exception workflows.
  • Provides highly actionable follow-up questions that map well to the buyer’s actual cues.
Biggest misses
  • The coach underplays the governance-lecture problem; it notices standard messaging but does not fully call out that the seller continued explaining basics after Amazon said it already understood the pattern.
  • The coach does not explicitly evaluate qualification gaps around active initiative, sponsor, decision process, timeline, budget, and success criteria.
  • The coach is a bit too generous on complementarity. Devon did say “alongside,” but the seller still failed to validate where HashiCorp would be additive versus duplicative in Amazon’s actual AWS-native and internal ecosystem.
1684gpt-5.5 lowStrong coach output with a few important under-calls. The coach correctly saw that the call was polished but strategically shallow, especially around Amazon-specific discovery, additive-versus-duplicative value, and passive follow-up. The main weakness is that it softened the benchmark’s central flaw by saying the team avoided lecturing Amazon, when the transcript shows the seller repeatedly reverted to generic operating-model explanation after sophisticated buyer cues.
Overall84
Needle recall81
Evidence grounding92
False-positive control84
Prioritization86
Actionability93
Sales instinct82
Technical accuracy90
How this model did

The coaching model was well grounded in the transcript and identified most of the hidden flaws: insufficient discovery into Amazon’s real constraints, failure to unpack exception handling and ownership boundaries, conceptual rather than proven differentiation, and a passive deck-based next step. It also fairly credited the seller’s product fluency and Devon’s accurate complementary positioning. However, it underweighted the degree to which Marissa continued a governance/platform lecture after Anjali said Amazon already had similar paved roads, and it did not fully diagnose classic qualification gaps such as active initiative, owner, timeline, decision process, budget, and success criteria.

Strongest findings
  • Correctly identified Anjali’s “exception paths, ownership boundaries, and maintenance cost of internal tooling” as the key discovery opening the seller failed to pursue.
  • Accurately captured the buyer’s central evaluation lens: whether HashiCorp would create additive evidence or net-new signal versus duplicating Amazon’s native and internal controls.
  • Gave transcript-grounded coaching to stop explaining once the buyer says they already understand the pattern and pivot into diagnostic questions.
  • Fairly credited Devon’s technically sound “alongside, not replacing” answer without letting it obscure the lack of tailored differentiation.
  • Provided actionable follow-up questions and practice drills that would improve future calls with sophisticated platform buyers.
Biggest misses
  • Understated the central benchmark flaw by saying the team avoided lecturing Amazon, even though the seller continued a generic governance explanation after multiple buyer cues.
  • Did not fully call out formal opportunity qualification gaps: active initiative, evaluation owner, decision process, budget/resource commitment, timeline, and success criteria.
  • Scored the next step too generously despite the absence of a mutual action plan or validated pain.
  • Could have been firmer that the call ended without a named Amazon constraint, affected group, current workaround, business impact, or priority level.
1784gpt-5.5 xhighStrong pass: the coach model captured the central strategic flaw and most hidden needles, with minor under-weighting of qualification gaps and some over-credit for complementary positioning.
Overall84
Needle recall85
Evidence grounding92
False-positive control82
Prioritization86
Actionability90
Sales instinct84
Technical accuracy90
How this model did

The coach output is well grounded in the transcript and largely aligned with the hidden benchmark. It correctly identifies that the HashiCorp team sounded professional and technically credible but failed to turn Amazon’s sophisticated cues into deeper discovery about internal constraints, additive-vs-duplicative value, exception handling, evidence ownership, and maintenance burden. It also catches the weak, passive follow-up. The main shortcomings are that it slightly over-praises the seller’s complementary positioning, softens the “lecturing a sophisticated buyer” flaw, and does not fully call out disciplined enterprise qualification gaps such as initiative status, evaluation owner, timeline, decision process, or success criteria.

Strongest findings
  • Correctly identifies the central issue: the sellers did not convert Amazon’s sophisticated corrections into targeted discovery.
  • Uses excellent transcript evidence, especially Anjali’s “not exactly,” “tricky bit,” and “we understand the pattern” cues, plus Michael’s “additive versus duplicative” and “net-new signal” comments.
  • Accurately distinguishes technical fluency from strategic relevance: Devon’s answer was credible, but the team still did not earn a specific use case.
  • Strongly catches the passive next step and recommends a more specific, low-pressure mutual action plan.
  • Actionable coaching is practical and sales-relevant: ask three discovery questions after buyer corrections, build an additive-vs-duplicative diagnostic, and use the SE as a discovery partner.
Biggest misses
  • The coach underdevelops the enterprise qualification critique: no active initiative, owner, decision process, timeline, budget/resource commitment, or success criteria were established.
  • It slightly softens the benchmark’s communication-style flaw by describing the call as a solid, credible first conversation rather than more directly calling out the risk of lecturing an expert buyer.
  • It over-credits complementary positioning based on Devon’s one strong answer, while the broader seller motion still defaulted to a generic Terraform/Vault operating-model pitch.
  • It could have more explicitly tied the missed discovery to Amazon-specific build-vs-buy posture and AWS-native/homegrown tooling decisions.
1884gemini 3.1 pro previewWorstStrong evaluation with one notable gap on formal qualification.
Overall84
Needle recall80
Evidence grounding90
False-positive control88
Prioritization90
Actionability82
Sales instinct84
Technical accuracy88
How this model did

The coach output correctly identified the central hidden flaw: Marissa opened with appropriate humility but then reverted to a generic cloud operating model pitch instead of probing Amazon-specific constraints. It was well grounded in transcript evidence, especially around Anjali’s cues about existing paved roads, maintenance cost, exception paths, and the weak deck-based follow-up. The coach also appropriately credited Devon’s bounded technical answer. The main miss is that it did not fully call out enterprise qualification gaps such as initiative status, owner, decision process, timeline, budget/resource commitment, or success criteria. It also only partially addressed the complement-vs.-native tooling issue, because it praised Devon’s complementary positioning but did not fully separate that from the broader failure to map HashiCorp to a validated Amazon-specific gap.

Strongest findings
  • Correctly identified that the seller reverted to generic cloud operating model explanation after Amazon showed high maturity.
  • Strong transcript-grounded use of Anjali’s “maintenance cost of internal tooling” and “we do something similar today” cues.
  • Accurately flagged the weak, passive close where Amazon agreed only to review materials and come back if relevant.
  • Appropriately credited Devon’s technical answer while still noting the missed opportunity to ask Michael about current-state gaps.
Biggest misses
  • Did not fully diagnose formal qualification gaps: no initiative status, owner, timeline, budget/resources, decision process, or success criteria were uncovered.
  • Only partially framed the AWS-native/internal-tooling complementarity issue; it recognized the need to ask where native controls fall short but did not fully evaluate the lack of a specific wedge use case.
  • The next-step coaching focused more on getting a placeholder meeting than on building a mutual action plan tied to a validated constraint.