Skip to results
salesevals.com/Evaluated Apr 30, 2026

Which models know sales?

Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.

Calls
25
Models
18
Evaluations
450
Mean
89.8
25 calls · 450 evaluationsMetric: OverallBuild-time static dataEvals completed Apr 30, 2026
25 benchmark calls

The 25 calls

Open a call to read its answer key and how every model did on it.

ExxonMobil AI governance and safety review for energy operations with Anthropic

Product demomixed39m · 30 turns
SellerAnthropic
BuyerExxonMobil

The call should feel credible and safety-forward, with the Anthropic seller earning trust by framing Claude as governed decision support for high-consequence energy workflows and by refusing to bluff on an unanswered deployment-control detail. However, the seller should leave some coaching room: the conversation should not fully translate governance discussion into ExxonMobil-specific pilot success criteria, decision gates, or a crisp mutual action plan.

Profile
Mixed
Flaws / Strengths
3 / 3
Duration
39m · 30 turns

What this call should surface

+ strength

Frames AI around ExxonMobil’s safety-critical operating context, not generic productivity

Executive Alignment · moderate

+ strength

Explains a layered governance model for Claude in operational decision-support use cases

Technical Knowledge · obvious

+ strength

Handles an unanswered deployment-control question transparently instead of bluffing

Objection Handling · moderate

flaw

Governance discovery is thoughtful but not sufficiently ExxonMobil-specific

Discovery · subtle

flaw

Safety posture is strong, but business value and pilot success criteria remain underdefined

Value Alignment · subtle

flaw

Next step is directionally right but not a crisp mutual action plan

Next Steps · moderate

30 speaker turns · 39m timeline

Transcript

The exact speaker-labeled transcript the coach models saw.

Maya ChenSellerLaura MitchellBuyerOmar HaddadBuyerDevin PatelSeller
  1. MC

    Maya Chen

    Seller

    Good morning, everyone — thanks for making the time. I’m Maya Chen with Anthropic, and I lead strategic accounts in industrial and energy sectors. I know for ExxonMobil this conversation is not about chasing generic AI productivity; it’s about whether a model like Claude can be introduced in a way that respects operational safety, reliability, cybersecurity, compliance, and human accountability. What I’d suggest for today is pretty simple: first, quick introductions; then we’ll understand where you’re thinking about AI use cases and risk boundaries; Devin can walk through how we think about governance and controls; and we’ll leave time for the hard questions on deployment, auditability, and what would need to be true before anything gets near operational decision support. Does that work as a starting point?

  2. LM

    Laura Mitchell

    Buyer

    Yes, that works. I’m Laura Mitchell — I run digital operations governance on our side. My focus today is really whether your approach fits the discipline we need around operating environments, not just whether the model is capable. If we leave with a clearer sense of boundaries and what a responsible first step could look like, that’s useful.

  3. OH

    Omar Haddad

    Buyer

    Sure. Omar Haddad here — I’m on the OT cybersecurity and process safety systems side. I’ll mostly be listening for where the boundaries are: what the model can see, what it can influence, how we audit it, and how we prevent advisory use from drifting into operational decision-making without the right controls.

  4. DP

    Devin Patel

    Seller

    Thanks, Omar. I’m Devin Patel, solutions engineer on Anthropic’s enterprise AI safety team. I’m here to get concrete on controls — access, logging, evals, escalation, and where Claude should simply not be in the loop.

  5. MC

    Maya Chen

    Seller

    Great. Laura, maybe start with the use cases you’re considering first?

  6. LM

    Laura Mitchell

    Buyer

    Yeah. We’re looking at this in phases. The near-term interest is mostly lower-risk: internal knowledge search, summarizing technical documentation, helping teams navigate procedures or standards without replacing the procedure itself. If that works, people will naturally ask about engineering and maintenance support — troubleshooting context, draft work instructions, that kind of thing. That’s where our concern goes up quickly. We need to understand how you prevent a plausible-sounding answer from becoming operational guidance before it has been reviewed through our normal channels.

  7. MC

    Maya Chen

    Seller

    That distinction makes sense. We would not frame Claude as replacing the approved procedure or the accountable engineer. For the first phase, are you mainly thinking about enterprise knowledge users — engineering, operations support, corporate functions — or a narrower governed group where you can control access and review patterns more tightly?

  8. LM

    Laura Mitchell

    Buyer

    Initially narrower. We’d probably start with a governed group in engineering and operations support, plus some corporate users for policy and standards search. I don’t think we’re ready to put this broadly in front of field teams until we understand the controls and audit trail.

  9. MC

    Maya Chen

    Seller

    Got it. That narrower start is consistent with what we’d recommend. Devin, maybe it’s worth laying out the control layers before we get into specific deployment questions.

  10. DP

    Devin Patel

    Seller

    Yeah, absolutely. I’d think about it in layers, and the first layer is actually scope: we separate low-risk knowledge retrieval from anything that looks like engineering or maintenance decision support. For the early group, Claude can help find and summarize approved material, but the source of truth remains your controlled documents. Second is access and data boundaries — role-based access, which repositories are connected, what data classes are excluded, and clear rules that the model is not connected to control systems or triggering operational actions. Third is auditability: prompts, responses, users, timestamps, source references where retrieval is used, and retention aligned to your policy. And then fourth is safety assurance: pre-deployment evaluation sets, red-team tests for unsafe or overconfident outputs, monitoring after launch, and escalation paths when the model gives an answer that should be reviewed. The key point is Claude is advisory decision support, not an autonomous operational actor.

  11. OH

    Omar Haddad

    Buyer

    Okay, that’s helpful. The piece I’d want to stress is workflow drift. A user asks for a summary today, tomorrow they ask, “what should I do next?” So how do you distinguish harmless retrieval from advice that should be blocked or routed to a human reviewer?

  12. DP

    Devin Patel

    Seller

    Yeah — that drift is exactly the risk pattern we’d design around. Practically, we’d treat it as both a policy problem and a product-control problem. You define allowed intents up front: summarize this approved standard, compare two approved documents, extract references. Then you define restricted intents: prescribe an operational action, change a procedure, troubleshoot equipment without review. For the restricted category, the right behavior is not “answer more carefully.” It’s refuse, redirect to the approved workflow, or flag that a qualified reviewer has to be involved. And we’d test those boundaries before launch with prompts that deliberately try to move from retrieval into advice.

  13. OH

    Omar Haddad

    Buyer

    Right. But is that enforced technically, or is it mostly acceptable-use policy? That distinction matters a lot for us.

  14. DP

    Devin Patel

    Seller

    It needs to be both. Acceptable-use policy by itself is not enough for this environment. Technically, you’d want the application layer to classify the request, apply role and use-case permissions, log the interaction, and either allow, refuse, or route to a review workflow. The model behavior is one control, but we would not rely on model behavior alone. Where I’d be careful is saying the exact enforcement pattern before we know the deployment path — API, enterprise UI, retrieval layer, identity provider, your workflow tooling. But the principle is: prohibited operational guidance should have a technical gate and an audit trail, not just a training slide.

  15. OH

    Omar Haddad

    Buyer

    Let me make it concrete. If we designate an OT user group, can you enforce a hard runtime block that prevents procedure-change or troubleshooting prompts across both the API and the enterprise UI — and can we audit that the block fired, not just that the user got a warning?

  16. DP

    Devin Patel

    Seller

    I don’t want to guess on a control detail for an environment this critical. The answer may differ between API, enterprise UI, and whatever enforcement layer sits in your workflow. Let me take that as an action item and bring back our deployment/security specialist with a written answer: what can be hard-blocked, where the audit event is generated, constraints, and who owns implementation on each side.

  17. OH

    Omar Haddad

    Buyer

    That’s the right answer. I’d rather hear “we need to verify” than get a confident maybe. For us, that written control detail is going to matter before this gets anywhere near OT-adjacent workflows.

  18. LM

    Laura Mitchell

    Buyer

    Good, and Omar, I agree — we’ll need that in writing. Maybe zooming out for a second: if we started with lower-risk documentation and knowledge workflows, what would Anthropic suggest the pilot is actually proving, beyond “the model behaved safely in a sandbox”?

  19. MC

    Maya Chen

    Seller

    Yeah, that’s a fair challenge. I would not frame the first pilot as just a model test. I’d frame it as proving the operating model: can Claude work against approved content, give traceable answers with citations, stay inside the allowed-use boundary, and fit into your review and escalation process without creating new unmanaged risk. The initial value is usually faster access to internal standards, procedures, and engineering knowledge, with less time spent hunting across repositories. But I’d keep that first phase deliberately away from any autonomous operational recommendation.

  20. LM

    Laura Mitchell

    Buyer

    Okay, that makes sense as a first boundary. I’d just flag we’ll need to be clearer on what “faster access” means and what evidence would justify moving beyond documentation.

  21. MC

    Maya Chen

    Seller

    Agreed. We shouldn’t treat “faster” as a hand-wavy benefit. I think part of the next session should be separating two things: evidence that the controls are working, and evidence that the workflow is actually useful enough for your teams to adopt. We can come back with a starter set of pilot measures, but I’d want to calibrate those with your governance and operations folks before pretending they’re final.

  22. LM

    Laura Mitchell

    Buyer

    Okay. I think that points us toward a broader governance review rather than trying to bless a pilot off this call. We’d want operations, OT cyber, legal, compliance, and data governance in the room — and we’ll need Omar’s control question answered before we get too enthusiastic.

  23. MC

    Maya Chen

    Seller

    That’s exactly the right forum. We can set up a safety and governance workshop with those groups, bring the written control response Omar asked for, and use the session to map allowed use cases, boundaries, audit expectations, and what a responsible first pilot could look like. I’ll coordinate with Devin on the technical pre-read and send over a proposed outline.

  24. LM

    Laura Mitchell

    Buyer

    That works. Send the outline and the control write-up, and I’ll circulate internally. I’m not ready to commit attendees until we see what you’re proposing, but directionally this is the right next step.

  25. DP

    Devin Patel

    Seller

    Yep, and I’ll own getting the control question routed to the right deployment and security folks on our side. We’ll make sure the write-up is specific enough for Omar’s team to react to, not just a generic architecture note.

  26. OH

    Omar Haddad

    Buyer

    That’s helpful, Devin. If the write-up can separate what’s enforceable technically versus what relies on ExxonMobil workflow controls, that’ll make our review a lot cleaner.

  27. DP

    Devin Patel

    Seller

    Absolutely. We’ll split that cleanly — technical enforcement, customer-side workflow controls, audit evidence, and any open constraints we still need to validate.

  28. MC

    Maya Chen

    Seller

    Great. Laura, Omar, thank you both — this was very helpful. Maya here, I’ll send a follow-up email with the workshop outline and Devin’s control write-up called out separately, and then we can figure out the right internal audience on your side from there.

  29. LM

    Laura Mitchell

    Buyer

    Thanks, Maya. Appreciate the candor today. We’ll look for the email, and then I’ll pull the right people together on our side.

  30. MC

    Maya Chen

    Seller

    Perfect. Thanks everyone — we’ll get that over to you and follow up from there. Have a good rest of the day.

Sorted by overall

How each model scored this call

Click a row to read the model's coaching note and the judge's read on it.

196gpt-5.4 lowBestexcellent
Overall96
Needle recall98
Evidence grounding97
False-positive control98
Prioritization96
Actionability95
Sales instinct96
Technical accuracy97
How this model did

The coach output very closely matches the hidden ground truth. It correctly recognizes the call as credible, safety-forward, and moderately positive, with Anthropic earning trust through ExxonMobil-specific risk framing, layered governance controls, and transparent handling of an unanswered deployment-control question. It also identifies the key coaching gaps: discovery stayed too generic for ExxonMobil’s actual operating environment, pilot value/success criteria were underdefined, and the workshop next step lacked a crisp mutual action plan. Evidence is well grounded in the transcript and there are no material unsupported claims.

Strongest findings
  • Correctly identifies the opening as strong executive alignment to ExxonMobil’s safety-critical operating context.
  • Correctly praises the layered governance/control model, including access controls, auditability, evals, red-teaming, monitoring, and human oversight.
  • Correctly treats Devin’s refusal to speculate on a hard deployment-control question as a trust-building strength.
  • Accurately surfaces the main commercial weaknesses: underdefined pilot metrics, value case, approval path, and next-step structure.
  • Provides actionable coaching recommendations and follow-up questions that are tightly connected to the transcript and benchmark gaps.
Biggest misses
  • No material misses. The only minor gap is that the coach could have been more explicit about asking for ExxonMobil-specific evaluation artifacts, historical unsafe-output examples, or incident/near-miss learning to tailor domain-specific evals.
296gpt-5.4 mediumExcellent match to ground truth
Overall96
Needle recall97
Evidence grounding96
False-positive control95
Prioritization96
Actionability96
Sales instinct97
Technical accuracy97
How this model did

The coach output accurately captured the hidden benchmark’s mixed assessment: a credible, safety-forward Anthropic call that built trust with ExxonMobil through strong context-setting, layered governance, and transparent handling of an unknown technical control detail, while leaving room to improve ExxonMobil-specific discovery, pilot value metrics, and mutual action planning. The analysis is strongly transcript-grounded, prioritizes the right issues, and contains no material unsupported claims.

Strongest findings
  • Correctly praised Maya’s opening for anchoring the call in ExxonMobil’s safety-critical context rather than generic AI productivity.
  • Correctly identified Devin’s layered governance model as a major strength, with concrete controls rather than vague safety branding.
  • Correctly treated Devin’s refusal to bluff on the hard runtime-block question as a trust-building moment.
  • Correctly surfaced the main coaching gaps: insufficient ExxonMobil-specific discovery, underdefined pilot value metrics, and a soft next step lacking mutual action-plan rigor.
Biggest misses
  • No major misses. The only slight gap is that the coach could have more explicitly called out the absence of domain-specific evaluation design, such as unsafe-output examples, historical incidents, or ExxonMobil-specific failure modes.
  • The coach’s numeric scoring for next-step control was somewhat generous, but the narrative still correctly identified the weakness.
396gpt-5.5 highExcellent alignment with the hidden ground truth
Overall96
Needle recall98
Evidence grounding95
False-positive control92
Prioritization96
Actionability97
Sales instinct96
Technical accuracy95
How this model did

The coach correctly captured the mixed but positive nature of the call: Anthropic built trust with a safety-forward posture, gave a concrete governance/control framework, and handled the hard deployment-control question with appropriate candor. It also identified the main coaching gaps: discovery did not go deeply enough into ExxonMobil-specific workflows and governance artifacts, pilot value metrics were underdefined, and the close lacked a crisp mutual action plan with dates, owners, and decision criteria. Evidence use was strong and largely transcript-grounded. The only minor issue is that the coach slightly treated the inability to answer Omar’s runtime-block question live as a technical-readiness gap, whereas the benchmark specifically views the transparent non-answer plus written follow-up as the right behavior.

Strongest findings
  • Correctly identified the opening as strong account-specific executive alignment around safety, reliability, cybersecurity, compliance, and human accountability.
  • Accurately credited the layered governance/control model rather than treating the call as vague AI safety positioning.
  • Correctly treated Devin’s refusal to bluff on the hard control question as a major trust-building moment.
  • Identified the subtle core gap: the call stayed too conceptual and did not map deeply enough to ExxonMobil’s workflows, data classes, approval processes, and governance artifacts.
  • Precisely diagnosed the weak pilot framing: value metrics and success criteria were not defined until Laura challenged the “faster access” claim.
  • Correctly coached the close from a soft workshop proposal toward a mutual action plan with owners, dates, deliverables, and decision criteria.
Biggest misses
  • No major hidden benchmark needle was missed.
  • The coach could have been even clearer that Devin’s lack of a live answer to Omar’s hard-block question should not count against the seller, provided the written follow-up is specific and owned.
  • The discovery coaching could have more explicitly mentioned buyer-specific evaluation design, such as using ExxonMobil-defined unsafe-output examples, internal standards, historical incidents, or near-miss patterns.
495gpt-5.4 noneExcellent benchmark alignment
Overall95
Needle recall97
Evidence grounding96
False-positive control97
Prioritization95
Actionability96
Sales instinct95
Technical accuracy97
How this model did

The coach output closely matches the hidden mixed-call ground truth. It correctly praises the seller’s safety-critical framing, layered governance articulation, and transparent handling of an unanswered deployment-control question. It also captures the main coaching gaps: discovery did not go deeply enough into ExxonMobil-specific workflows/governance, value and pilot metrics were underdefined, and the next step was directionally right but not yet a mutual action plan. Evidence is well grounded in the transcript with minimal unsupported claims.

Strongest findings
  • Correctly treated the deployment-control uncertainty as a trust-building strength rather than penalizing Devin for not knowing the exact enforcement details live.
  • Strongly grounded the call’s biggest strength in Maya’s opening framing around safety, reliability, cybersecurity, compliance, and human accountability.
  • Accurately identified the layered governance model as concrete and credible, including access controls, auditability, red-teaming, monitoring, and restricted-use boundaries.
  • Captured the main commercial gap: safety was strong, but value metrics, pilot success criteria, and scaling rationale were still too abstract.
  • Correctly diagnosed that the workshop was a good next step but lacked the specificity of a mutual action plan.
Biggest misses
  • Very little. The coach could have been slightly sharper in calling out ExxonMobil-specific operational discovery gaps, such as concrete maintenance, engineering change, refinery operations, trading/logistics, subsurface data, or historical near-miss examples.
  • The phrase “Strong discovery” in the executive summary slightly overstates the discovery quality, though the rest of the output correctly qualifies discovery as high-level and needing more specificity.
594gpt-5.4 highExcellent alignment with the hidden ground truth
Overall94
Needle recall97
Evidence grounding96
False-positive control95
Prioritization93
Actionability94
Sales instinct94
Technical accuracy95
How this model did

The coach output accurately captured the mixed-call pattern: Anthropic earned trust through safety-forward framing, a concrete governance model, and transparent handling of an unanswered deployment-control question, while leaving coaching room around deeper ExxonMobil-specific discovery, measurable pilot value, and a crisper mutual action plan. The feedback is well grounded in transcript evidence and avoids penalizing Devin for appropriately deferring the hard technical-control question. Minor weakness: the coach slightly over-scored/over-praised the next step relative to the benchmark flaw, though it still identified the lack of timing, decision cadence, and firmer commitments.

Strongest findings
  • Correctly praised the opening for framing the discussion around ExxonMobil’s safety-critical operating context rather than generic productivity.
  • Correctly identified Devin’s transparent deferral on the hard runtime-block/auditability question as a trust-building strength, not a weakness.
  • Accurately captured the layered governance model and its relevance to operational decision-support boundaries.
  • Strongly surfaced the key coaching gap around pilot success criteria and measurable business value.
  • Provided actionable follow-up questions and coaching drills that align with the benchmark’s recommended improvements.
Biggest misses
  • No major hidden-ground-truth miss. The coach covered all six benchmark needles.
  • The coach could have been slightly sharper in downgrading the close: the next step lacked not only timeline but also explicit ExxonMobil owner, decision criteria, and mutual action-plan structure.
  • The coach’s note that the governance/technical score was not higher because enforcement details were deferred is acceptable, but it should be framed carefully so the seller is not penalized for appropriately refusing to bluff.
694gpt-5.5 noneStrong pass
Overall94
Needle recall97
Evidence grounding96
False-positive control92
Prioritization94
Actionability95
Sales instinct95
Technical accuracy96
How this model did

The coach output closely matches the hidden benchmark. It correctly recognizes the call as moderately positive and safety-forward, praises the seller’s executive framing, layered governance discussion, and transparent handling of the hard deployment-control question, while also identifying the main coaching gaps around ExxonMobil-specific discovery, pilot value metrics, decision process, and next-step rigor. The feedback is well grounded in transcript evidence and mostly avoids unsupported claims. Minor room for improvement: the coach slightly over-scores the next-step quality relative to the benchmark’s emphasis that it was not a crisp mutual action plan, and it adds a low-priority competitive-alternatives point outside the core ground truth, though this is not materially harmful.

Strongest findings
  • Correctly identifies the opening executive framing around safety-critical operations as a major strength.
  • Correctly credits Devin’s layered governance model with concrete controls rather than vague AI-safety language.
  • Correctly praises the refusal to bluff on the hard runtime-control question and ties it to buyer trust.
  • Correctly elevates pilot success criteria and value metrics as the top coaching opportunity.
  • Correctly distinguishes a reasonable workshop next step from a true mutual action plan with dates, owners, decision criteria, and required stakeholders.
Biggest misses
  • The coach could have been slightly sharper that the next step remained non-committal and under-owned, rather than giving mutual action planning an 8/10.
  • The coach did not fully unpack the benchmark’s most ExxonMobil-specific discovery gaps, such as probing historical unsafe-output examples, incident/near-miss learning, or current model-risk/process-safety/change-management artifacts.
  • The competitive-alternatives point is plausible but not central to this benchmark and could distract from the more important qualification gaps around value, approval path, and pilot gates.
794gpt-5.5 xhighExcellent evaluator output; strongly aligned with the hidden ground truth.
Overall94
Needle recall97
Evidence grounding96
False-positive control92
Prioritization94
Actionability96
Sales instinct95
Technical accuracy95
How this model did

The coach correctly captured the mixed-but-positive pattern: Anthropic built trust through safety-first framing, concrete governance controls, and transparent handling of an unknown deployment-control detail, while leaving room to improve around ExxonMobil-specific discovery, pilot success metrics, decision process, and a crisper mutual action plan. The assessment is well grounded in transcript evidence and identifies all six benchmark needles with only minor calibration issues, mainly that some numeric scores, especially close/next steps and overall call rating, are a touch generous given the underdeveloped mutual plan.

Strongest findings
  • Correctly identified the safety-first executive framing as a major trust-builder for ExxonMobil’s high-consequence operating context.
  • Accurately credited the layered governance model, including use-case boundaries, access controls, logging/auditability, evals, red-teaming, monitoring, and escalation.
  • Properly treated Devin’s refusal to bluff on the hard deployment-control question as a strength, citing Omar’s explicit positive reaction.
  • Identified the key commercial gap: value and pilot success criteria were underdefined after Laura challenged the meaning of “faster access.”
  • Turned the closing weakness into actionable coaching around a mutual action plan, workshop deliverables, timing, stakeholders, and decision criteria.
Biggest misses
  • No major misses. The coach found all six hidden needles.
  • The coach could have gone slightly deeper on ExxonMobil-specific discovery gaps such as internal process-safety/change-management frameworks, historical failure modes, unsafe-output examples, and domain-specific evaluation-set design.
  • The coach’s numeric scoring was a little generous for next steps and overall opportunity advancement, given the absence of a firm workshop commitment or explicit approval path.
894gpt-5.4 xhighExcellent coaching output; it captures the mixed-call ground truth with strong semantic alignment and transcript-grounded evidence.
Overall94
Needle recall96
Evidence grounding94
False-positive control92
Prioritization95
Actionability96
Sales instinct95
Technical accuracy95
How this model did

The coach correctly recognized the call as credible, safety-forward, and moderately positive, while also identifying the main advancement gaps: shallow ExxonMobil-specific discovery, underdefined pilot success criteria, and a non-calendarized next step. It properly praised the seller for executive alignment, layered governance explanation, and transparent handling of an unanswered deployment-control question. The feedback is well prioritized and actionable, with only minor room to tighten a few claims around decision process and commercial qualification.

Strongest findings
  • Correctly praised the opening as strong executive alignment with ExxonMobil’s safety-critical operating context.
  • Correctly identified Devin’s layered governance model as a major technical credibility strength.
  • Correctly treated the unanswered runtime-blocking question as a trust-building moment because Devin avoided bluffing and assigned written follow-up.
  • Accurately diagnosed the main coaching gaps: insufficient buyer-specific discovery, undefined pilot metrics, and a non-calendarized next step.
  • Provided actionable next-call coaching, especially around a pilot scorecard, control matrix, discovery sequence, and mutual action plan discipline.
Biggest misses
  • No major hidden-ground-truth miss. The coach covered all six benchmark needles with strong semantic accuracy.
  • The coach could have made the opportunity outcome language slightly more explicit: moderately positive continuation into a workshop, but not fully advanced because the approval path and decision gates remain underdeveloped.
  • A few additional missed-opportunity items, such as prior AI efforts and funding/timeline qualification, are reasonable but go beyond the benchmark and should remain secondary to the core governance/value/MAP gaps.
994opus 4.7 maxstrong
Overall94
Needle recall98
Evidence grounding93
False-positive control90
Prioritization95
Actionability96
Sales instinct95
Technical accuracy94
How this model did

The coach output aligns very closely with the hidden ground truth. It correctly recognizes the call as credible, safety-forward, and moderately positive, while still identifying the key coaching gaps: discovery did not become ExxonMobil-specific enough, value and pilot success criteria remained underdeveloped, and the close lacked a crisp mutual action plan. The coach also properly treats Devin’s refusal to guess on the deployment-control question as a strength rather than a weakness. Minor issues: the executive summary slightly overstates the agreed deliverable as a “written control matrix,” when the transcript supports a control write-up and workshop outline, not a fully agreed matrix.

Strongest findings
  • Correctly identified the strongest opening move: Maya anchored the call in ExxonMobil’s safety-critical operating context rather than generic AI productivity.
  • Accurately praised Devin’s layered control model, including scope, access/data boundaries, auditability, evals/red-teaming, monitoring, and escalation.
  • Properly treated the unanswered hard-block question as a credibility-building moment because Devin refused to bluff and committed to a specialist-written response.
  • Captured the mixed-call pattern: strong trust-building and governance fluency, but incomplete value definition, pilot success criteria, and commercial progression.
  • Provided actionable coaching recommendations: discovery questionnaire, pilot scorecard, timeline/deliverable structure, stakeholder pre-read, and control-framework artifact.
Biggest misses
  • The coach slightly overclaimed that a written control matrix was an agreed next step, when the transcript only supports a control write-up and workshop outline.
  • The coach could have been even more explicit that the opportunity outcome is only moderately positive: ExxonMobil is willing to continue, but the evaluation remains unqualified until value, approval path, and implementation gates are clearer.
  • Some additional missed opportunities, such as comparable deployment references, are reasonable but less central than the hidden benchmark’s core focus on ExxonMobil-specific discovery, value metrics, and mutual action planning.
1093gpt-5.5 lowExcellent judge-aligned coaching output
Overall93
Needle recall95
Evidence grounding95
False-positive control94
Prioritization92
Actionability94
Sales instinct93
Technical accuracy94
How this model did

The coach output closely matches the hidden ground truth. It recognizes the call as credible, safety-forward, and moderately positive, while correctly preserving the mixed-call pattern: Anthropic earned trust through strong governance framing and transparent handling of a technical unknown, but did not fully convert the discussion into measurable pilot success criteria, stakeholder/decision mapping, or a crisp mutual action plan. Evidence is well grounded in the transcript, with no material hallucinated findings.

Strongest findings
  • Correctly praised the opening for anchoring in ExxonMobil’s safety-critical operating context rather than generic AI productivity.
  • Correctly identified the layered governance model as a core strength, including use-case boundaries, access controls, logging, evals, red-teaming, escalation, and human review.
  • Correctly treated Devin’s refusal to bluff on the hard-block/auditability question as a trust-building strength, not a weakness.
  • Correctly surfaced the main opportunity-development gaps: measurable pilot success criteria, decision-process discovery, stakeholder mapping, timeline, and mutual action planning.
  • Recommendations were highly actionable, especially the suggested pilot scorecard, stakeholder-specific workshop mapping, and control matrix by deployment path.
Biggest misses
  • The coach could have made the ExxonMobil-specific discovery gap slightly sharper by explicitly naming missing probes into operational workflows, risk tiers, data classifications, process-safety/change-management artifacts, and domain-specific eval examples.
  • The coach’s discovery critique leaned somewhat toward generic enterprise sales process items like timeline and approval path, though these were still relevant and grounded.
1193opus 4.7 xhighStrong pass
Overall93
Needle recall97
Evidence grounding90
False-positive control86
Prioritization95
Actionability95
Sales instinct96
Technical accuracy93
How this model did

The coach output closely matches the hidden ground truth. It correctly recognizes the call as credible and safety-forward, credits the safety-critical framing, layered governance explanation, and transparent handling of the deployment-control gap, while also identifying the key coaching gaps around ExxonMobil-specific discovery, pilot value metrics, and a more concrete mutual action plan. The main deductions are minor: a couple of evidence claims overstate what was in the transcript, especially around Anthropic-side ownership and a quoted/attributed point about change management and executive accountability.

Strongest findings
  • Correctly treats Devin’s transparent non-answer on the hard runtime-block question as a high-value trust-building moment, not a weakness.
  • Accurately identifies the layered governance model as the technical center of the call and grounds it in scope, access/data boundaries, auditability, evals, red-teaming, monitoring, and escalation.
  • Clearly captures the mixed-call pattern: strong safety credibility and buyer trust, but underdeveloped business value metrics and pilot success criteria.
  • Provides actionable coaching by recommending a pilot scorecard with separate control-effectiveness and workflow-usefulness measures.
  • Correctly identifies that the close needs dates, stakeholder mapping, ownership, and decision criteria to become a mutual action plan.
Biggest misses
  • The coach did not explicitly emphasize the opening executive-alignment moment as much as it could have, although it captured the substance elsewhere.
  • The coach could have been more precise in the discovery critique by naming ExxonMobil-specific workflows and data domains that were not explored, such as maintenance, engineering change, refinery operations, trading/logistics, or subsurface data.
  • A couple of evidence claims are slightly overreaching or unsupported, especially around Anthropic-side delivery ownership and Laura’s supposed change-management/executive-accountability framing.
1292gpt-5.5 mediumStrong judge match: the coach captured the mixed-call pattern very well, identifying the major strengths around safety-forward positioning, layered governance, and transparent non-bluffing, while also coaching the real gaps around pilot metrics, deeper ExxonMobil-specific discovery, and a sharper mutual action plan.
Overall92
Needle recall94
Evidence grounding95
False-positive control90
Prioritization91
Actionability94
Sales instinct92
Technical accuracy96
How this model did

The coach output is highly aligned to the hidden ground truth. It correctly treats the call as moderately positive rather than overly negative, praises the seller for executive alignment and safety credibility, and does not penalize Devin for refusing to guess on the hard runtime-blocking question. The main imperfection is that the coach’s discovery critique is somewhat broader/generic than the benchmark’s more specific point about ExxonMobil workflows, data classes, risk tiers, and domain-specific evaluation design. It also adds a minor competitive-differentiation coaching point that is not strongly supported by the call context. Overall, this is a well-grounded, actionable evaluation.

Strongest findings
  • Correctly praised the opening for explicitly rejecting generic AI productivity framing and anchoring on safety, cybersecurity, compliance, reliability, and accountability.
  • Accurately identified Devin’s layered governance explanation as a major technical-credibility strength.
  • Correctly treated Devin’s refusal to guess on a hard deployment-control question as a trust-building strength, not a weakness.
  • Strongly captured the main commercial gap: pilot value, metrics, and success criteria were not sufficiently defined.
  • Correctly coached the close from a good workshop proposal toward a more rigorous mutual action plan with timing, owners, artifacts, and decision criteria.
Biggest misses
  • The discovery critique could have been more ExxonMobil-specific: actual operational workflows, risk tiers, internal standards, historical unsafe-output examples, near-miss learning, and domain-specific evaluation design.
  • The coach slightly over-credited the next step as a secured advancement, despite Laura not yet committing attendees and no date or decision gate being agreed.
  • A small amount of generic sales coaching appeared, especially around competitive differentiation and budget, which was not central to this call’s benchmark.
1392sonnet 4.6Strong pass: the coach captured the hidden mixed-call pattern very well, with only minor overstatement and a few unsupported embellishments.
Overall92
Needle recall97
Evidence grounding89
False-positive control86
Prioritization91
Actionability93
Sales instinct92
Technical accuracy92
How this model did

The coaching output aligns closely with the benchmark. It correctly praises the seller’s safety-forward executive framing, layered governance discussion, and transparent handling of Omar’s deployment-control question. It also identifies the main coaching gaps: discovery was not sufficiently ExxonMobil-specific, pilot value/success criteria were thin, and the workshop next step lacked crisp mutual-action-plan mechanics. The main weaknesses in the coach output are tonal over-optimism around the close being “well-defined” and a few invented or exaggerated details, such as calling Laura a VP and attributing more proactive hallucination-risk language than the transcript supports.

Strongest findings
  • Correctly identifies Maya’s opening as strong executive alignment with ExxonMobil’s safety-critical operating context.
  • Accurately credits Devin’s layered governance framework, including scope controls, access boundaries, auditability, evals/red-teaming, monitoring, escalation, and non-autonomous positioning.
  • Excellent treatment of Devin’s refusal to bluff on the hard runtime-block/auditability question; the coach recognizes this as a trust-building strength.
  • Strong diagnosis of the value gap, especially using Laura’s “faster access” challenge as evidence that pilot success criteria need to be defined.
  • Strong close coaching: the recommendation to propose a target week, clarify Laura’s internal review process, and convert the workshop into a timeline is directly actionable.
Biggest misses
  • No major hidden benchmark needle was missed.
  • The coach is slightly more bullish than the benchmark in places, especially by describing the next step as “well-defined” despite later recognizing the lack of a mutual action plan.
  • A few details are not transcript-grounded, such as Laura’s supposed VP title and some exact governance language attributed to her.
  • The coach could have more explicitly summarized the overall opportunity state as moderately positive but still partially unqualified around approval path, decision gates, and implementation gating.
1487opus 4.7 mediumStrong judgeable coaching output with one notable gap
Overall87
Needle recall85
Evidence grounding92
False-positive control90
Prioritization83
Actionability88
Sales instinct86
Technical accuracy91
How this model did

The coach accurately captured the main mixed-call pattern: Anthropic built credibility through safety-first positioning, layered governance controls, and transparent handling of an unanswered deployment-control question, while leaving pilot value and deeper customer-specific discovery underdeveloped. The biggest scoring weakness is that the coach was too generous on the close: it called the next step “clear” and scored it highly, while the benchmark expected coaching on the lack of a crisp mutual action plan with timing, owners, decision criteria, and ExxonMobil-side commitments.

Strongest findings
  • Correctly praised the opening safety-critical framing and avoidance of generic AI productivity messaging.
  • Correctly identified Devin’s layered governance explanation as a major strength, including access controls, auditability, evals, red-teaming, and advisory-only boundaries.
  • Correctly treated the unanswered hard-block question as a trust-building moment rather than a seller weakness.
  • Strongly captured the value/pilot-success gap and offered actionable metric examples.
  • Identified that discovery should have probed existing governance artifacts and data classification more deeply.
Biggest misses
  • Underweighted the benchmark flaw around next steps: the close lacked a mutual action plan with timing, owners, decision criteria, and buyer-side commitments.
  • Scored the close too generously despite Laura not committing attendees and no decision gate being defined.
  • Did not explicitly coach the seller to confirm who at ExxonMobil owns the evaluation, who signs off, and what the workshop should decide.
  • Only lightly connected the missing control matrix/pilot readiness deliverable to the broader MAP weakness.
1587opus 4.7 highStrong coaching output with one material calibration issue
Overall87
Needle recall88
Evidence grounding91
False-positive control86
Prioritization82
Actionability92
Sales instinct86
Technical accuracy94
How this model did

The coach accurately captured the main hidden pattern: Anthropic built credibility through safety-forward framing, a concrete governance/control model, and transparent handling of an unanswered deployment-control question, while leaving room to improve discovery depth and pilot value metrics. The biggest weakness is that the coach overpraised the close as a clear, mutually agreed next step and scored Next Steps very high, despite the benchmark expecting criticism that the workshop was not converted into a crisp mutual action plan with dates, stakeholder commitments, decision gates, and buyer-side ownership.

Strongest findings
  • Correctly elevated Maya’s safety-critical opening as a major executive-alignment strength.
  • Accurately credited Devin’s layered governance model and the distinction between model behavior, application-layer controls, policy, and auditability.
  • Correctly treated Devin’s transparent non-answer to Omar’s hard control question as a strength, not a weakness.
  • Strongly identified the underdeveloped pilot value story and proposed relevant starter KPI categories.
  • Caught the discovery-depth issue around data classes, existing controls, integrations, AI policy, and decision process.
Biggest misses
  • The coach materially over-scored the close and did not sufficiently frame the workshop as lacking a crisp mutual action plan.
  • The coach’s next-step assessment underweighted Laura’s explicit lack of commitment to attendees and the absence of dates, decision gates, and buyer-side ownership.
  • The discovery critique was good but could have gone deeper on ExxonMobil-specific operational workflows, risk tiers, incident/near-miss examples, and domain-specific evaluation design.
1686opus 4.7 lowstrong
Overall86
Needle recall83
Evidence grounding88
False-positive control86
Prioritization80
Actionability89
Sales instinct85
Technical accuracy91
How this model did

The coach output closely matches the hidden benchmark’s mixed-positive read: it correctly praises the safety-forward framing, layered governance discussion, and Devin’s transparent handling of the unresolved control question, while also identifying underdeveloped discovery and vague pilot success metrics. The main gap is that it under-coaches the close: the benchmark expected a clear callout that the workshop next step was not yet a crisp mutual action plan with timing, buyer-side owners, decision gates, and success criteria. The coach instead scored next steps fairly high and did not prioritize mutual-plan discipline.

Strongest findings
  • Correctly treated Devin’s transparent uncertainty on the hard runtime-block question as a high-value trust-building behavior.
  • Accurately recognized the layered governance/control narrative as a major strength, including scope, access, auditability, evals, monitoring, and escalation.
  • Captured the key business-value gap: the pilot was framed responsibly, but measurable success criteria remained vague.
  • Identified that discovery needed to go deeper into data classes, integration realities, and buyer-specific control requirements.
  • Provided actionable next-step coaching such as a control matrix, draft measurement framework, and technical pre-read questionnaire.
Biggest misses
  • Underweighted the lack of a crisp mutual action plan at the close; the coach should have called out missing timing, buyer-side ownership, required attendees, decision criteria, and deadlines.
  • Did not fully emphasize that ExxonMobil-specific operational workflow discovery was thin, beyond technical environment questions.
  • Some evidence for proprietary/subsurface/commercial data came from account context rather than the transcript, though the coaching point itself was valid.
1784deepseek v4 proStrong but slightly overgenerous evaluation. The coach captured the major strengths and the main value/metrics gap, but underweighted the weakness around mutual action planning and only partially captured the ExxonMobil-specific discovery gap.
Overall84
Needle recall86
Evidence grounding90
False-positive control78
Prioritization78
Actionability86
Sales instinct82
Technical accuracy88
How this model did

The coach correctly praised the seller for anchoring the conversation in ExxonMobil’s safety-critical context, explaining layered governance controls, and transparently refusing to bluff on Omar’s hard deployment-control question. It also identified the underdefined pilot success criteria and some discovery gaps. However, it rated the next steps too highly and described them as clear/well-defined even though the transcript leaves timing, ExxonMobil ownership, attendees, decision gates, and a mutual action plan unresolved. The coach’s overall tone is more positive than the hidden benchmark’s “moderately positive” outcome, but most findings are transcript-grounded and directionally useful.

Strongest findings
  • Correctly identified the safety-critical executive framing in Maya’s opening as a major trust builder.
  • Correctly praised Devin’s layered governance/control explanation with scope, access, auditability, evals, red-teaming, monitoring, and escalation.
  • Correctly treated Devin’s refusal to bluff on the hard runtime-block question as a strength, not a weakness.
  • Correctly flagged undefined pilot success criteria and recommended concrete KPIs for a documentation-search pilot.
  • Provided useful follow-up discovery questions around metrics, audit practices, governance frameworks, and access management.
Biggest misses
  • Underweighted the next-step weakness: the workshop proposal lacked timing, named ExxonMobil owners, decision criteria, attendee commitments, and a mutual action plan.
  • Only partially captured the need for more ExxonMobil-specific governance discovery, such as concrete workflows, data classes, risk tiers, process-safety artifacts, and domain-specific evaluation examples.
  • The scoring was too generous in areas where the transcript showed acknowledged gaps, especially Next Steps at 9 and Value Articulation at 8.
  • The coach did not clearly state that the call outcome was moderately positive rather than fully advanced.
1870gemini 3.1 pro previewWorstPartially aligned. The coach accurately recognized the strongest trust-building and safety-positioning moments, but it over-celebrated the call and missed important mixed-call caveats around ExxonMobil-specific discovery and the lack of a true mutual action plan.
Overall70
Needle recall62
Evidence grounding82
False-positive control63
Prioritization58
Actionability76
Sales instinct72
Technical accuracy82
How this model did

The coach did a strong job on the obvious strengths: Maya’s opening framed the conversation around ExxonMobil’s safety-critical context, and Devin’s refusal to bluff on the runtime-block question was correctly praised as a major trust builder. The coach also caught the need to quantify pilot value. However, it inflated the overall call quality with language like “exceptionally strong,” “perfectly,” and “successfully advanced,” while the hidden benchmark expects a more mixed read. The biggest gaps are that the coach did not identify the insufficiently ExxonMobil-specific governance discovery and directly contradicted the benchmark on next steps by scoring the close highly despite no date, no mutual owner map, no decision criteria, and only a directionally accepted workshop.

Strongest findings
  • Correctly praised Maya’s opening for anchoring the conversation in ExxonMobil’s safety-critical environment rather than generic AI productivity.
  • Correctly identified Devin’s refusal to guess on a critical deployment-control question as a major trust-building moment.
  • Correctly surfaced the need to quantify pilot value and move from vague “faster access” language to measurable success criteria.
  • Grounded most major claims in real transcript quotes, especially Omar’s validation and Laura’s challenge on pilot evidence.
Biggest misses
  • Did not identify the lack of ExxonMobil-specific governance discovery around operational workflows, data classes, risk tiers, internal standards, and domain-specific failure modes.
  • Contradicted the benchmark on next steps by treating a directionally accepted workshop as a secured, high-quality close.
  • Underweighted the importance of pilot success criteria by labeling business-value definition as a low/minor improvement.
  • Did not fully call out the layered governance model as a distinct technical strength with its concrete controls.