Which models know sales?
Eighteen model configurations coach the same 25 synthetic sales calls. Each call has hidden ground truth. A judge scores every coaching note from 0–100 on whether it found the real strengths, flaws, and next moves.
- Calls
- 25
- Models
- 18
- Evaluations
- 450
- Mean
- 89.8
The 25 calls
Open a call to read its answer key and how every model did on it.
- CollibraBerkshire HathawayBerkshire Hathaway Data governance discovery across decentralized business units with CollibraEasiestDiscoveryflawed95.4
- StripePavePave Pricing and packaging objection call with StripeCompetitive displacementflawed94.3
- VercelMercuryMercury First discovery for frontend platform consolidation with VercelDiscoveryflawed94.1
- AtlassianDelta Air LinesDelta Air Lines Enterprise discovery for service management modernization with AtlassianDiscoveryflawed94.0
- MongoDBWayfairWayfair Integration deep dive for catalog modernization with MongoDBProduct demoexcellent93.7
- TwilioThe Home DepotThe Home Depot Renewal save call after usage and support concerns with TwilioRenewal saveflawed93.7
- Palo Alto NetworksAppleApple Technical security review for zero trust architecture with Palo Alto NetworksProduct demoexcellent93.2
- AmplitudeDuolingoDuolingo Renewal QBR and expansion planning with AmplitudeQBRexcellent92.4
- OpenAICVS HealthCVS Health AI contact-center transformation discovery with OpenAIDiscoveryexcellent92.0
- GitHubRipplingRippling Product-led expansion discovery for developer workflow with GitHubDiscoveryexcellent91.8
- WorkdayMcKessonMcKesson HR transformation qualification and stakeholder mapping with WorkdayDiscoveryflawed91.1
- AnthropicExxonMobilExxonMobil AI governance and safety review for energy operations with AnthropicProduct demomixed90.9
- CrowdStrikeTargetTarget Security architecture review for endpoint consolidation with CrowdStrikeProduct demoexcellent90.8
- DatadogLinearLinear Technical demo for observability and incident response with DatadogProduct demoexcellent90.4
- ElasticJPMorgan ChaseJPMorgan Chase Technical workshop for search and observability consolidation with ElasticProduct demoexcellent90.4
- NVIDIAWalmartWalmart Executive discovery for AI infrastructure and store operations with NVIDIADiscoveryexcellent89.3
- HashiCorpAmazonAmazon Cloud operating model discussion for internal platform teams with HashiCorpDiscoveryflawed89.1
- ServiceNowFord Motor CompanyFord Motor Company Procurement negotiation for workflow automation with ServiceNowCompetitive displacementmixed88.6
- SnowflakeToastToast Data platform proof-of-concept kickoff with SnowflakeProduct demoflawed87.0
- CloudflareCanvaCanva Competitive displacement discovery for edge security with CloudflareCompetitive displacementflawed85.8
- FigmaThe Walt Disney CompanyThe Walt Disney Company Design collaboration demo with brand and asset workflow discussion with FigmaProduct demomixed85.8
- OktaSweetgreenSweetgreen Executive alignment for identity modernization with OktaQBRmixed85.2
- SalesforceUnitedHealth GroupUnitedHealth Group Healthcare CRM expansion objection handling with SalesforceRenewal savemixed84.9
- SnykRunwayRunway Security review before developer-tool rollout with SnykProduct demomixed82.5
- MicrosoftCostco WholesaleCostco Wholesale Proof-of-concept readout for analytics and productivity workflow with MicrosoftHardestProduct demomixed79.7
Duolingo Renewal QBR and expansion planning with Amplitude
The target call should feel like a high-quality incumbent renewal QBR that earns the right to discuss expansion. The seller team should lead with Duolingo-specific preparation, validate adoption and metric movement without overclaiming unsupported facts, connect Amplitude usage to executive priorities like subscriber growth, learner engagement, retention, AI-feature adoption, and experimentation velocity, then convert the QBR into a buyer-authored expansion plan with clear owners, success criteria, and timeline. The call may include one minor imperfection, such as leaving commercial packaging details for a follow-up rather than resolving every pricing question live, but the overall coaching judgment should be strongly positive.
- Profile
- Excellent
- Flaws / Strengths
- 1 / 4
- Duration
- 52m · 40 turns
What this call should surface
Earns expansion by first anchoring the QBR in adoption patterns and metric movement
Value Alignment · moderate
Connects Amplitude capabilities to Duolingo’s executive priorities rather than generic analytics features
Executive Alignment · subtle
Turns expansion planning into a buyer-authored prioritization exercise
Discovery · moderate
Closes with a crisp mutual action plan including owners, dates, success criteria, and procurement path
Next Steps · obvious
Minor imperfection: defers detailed packaging or pricing mechanics instead of fully resolving them live
Objection Handling · subtle
Transcript
The exact speaker-labeled transcript the coach models saw.
- MC
Mara Chen
Seller
Hi everyone, thanks for making the time. I’m Mara Chen from Amplitude, and I own the Duolingo relationship on our side. The goal today is pretty simple: first, validate the QBR view we prepared around how your product, growth, data, lifecycle, and subscription teams are using Amplitude today; second, connect that to the outcomes you actually care about — activation, habit formation, trial-to-paid, retention, Super and Max engagement; and then, only after we agree on the value story, spend the back half on renewal planning and any expansion areas that are worth your team’s time. Devon’s here with me for the data quality and workflow pieces. Does that agenda still work for everyone?
- SM
Sofia Morales
Buyer
Yep, works for me. I’m Sofia, I lead growth product — so I’m especially listening for what’s helping us move faster on activation, paywalls, and retention, not just dashboard usage.
- EK
Ethan Kim
Buyer
Ethan here, data platform and analytics. I’m mainly looking at metric consistency, taxonomy health, and whether the QBR lines up with our internal source of truth.
- DP
Devon Park
Seller
Hey all, Devon Park on the solutions side. I’ll keep us honest on instrumentation, governance, and what would actually change operationally for your teams.
- MC
Mara Chen
Seller
Great, thanks. Let me start with the value readout, and please pressure-test the assumptions.
- MC
Mara Chen
Seller
So, in the workbook we pulled — and Ethan, I’ll caveat this as Amplitude-side usage and analysis patterns, not your financial source of truth — we’re seeing three clusters of value. One is growth product teams coming back to onboarding and early habit dashboards: first session completion, lesson start-to-finish, streak continuation, those kinds of views. Second is subscription and lifecycle work, especially funnel and cohort analysis around paywall exposure, trial starts, and retention behavior after someone engages with Super or Max-related surfaces. And then third is the data org usage: saved cohorts, shared definitions, and a handful of executive-facing dashboards that look like they’re being reused rather than rebuilt every week. The pattern we prepared is that Amplitude is less about “more dashboards” at this point and more about faster answers to, like, where are learners dropping, which segments behave differently, and did a paywall or onboarding change move the right leading indicator without hurting engagement. Before I go any further, Sofia, does that match how your teams experience it, or would you frame the value differently?
- SM
Sofia Morales
Buyer
Yeah, that’s directionally right. The places my PMs cite Amplitude most are onboarding drop-off, streak behavior in the first week, and paywall test readouts — especially when we need an answer same day versus waiting on a custom analysis. I’d sharpen the value story a bit, though: execs won’t care that a dashboard was reused; they’ll care that we made faster calls on activation and trial conversion with less debate.
- MC
Mara Chen
Seller
That’s fair — and I’d use your wording, honestly: faster calls on activation and trial conversion with less metric debate. Ethan, I’d love to sanity-check that against your view before we move on: where do you still see definition drift or trust issues getting in the way?
- EK
Ethan Kim
Buyer
Yeah, mostly. The trust issue isn’t that people don’t use Amplitude — they do. It’s that activation or “retained learner” can mean slightly different things depending on whether growth, lifecycle, or finance is presenting it. We still reconcile against our warehouse for exec readouts. So if the renewal story is faster decisions, I’m with you, but I’d want the next phase to include tighter governed definitions and a validation loop back to our source-of-truth metrics.
- DP
Devon Park
Seller
Yep, that’s exactly the right next layer. Operationally, I’d separate two things: Amplitude remains the fast exploration layer for PMs, but the definitions for activation, retained learner, trial conversion, et cetera need an agreed owner and a warehouse validation check before they show up in exec reporting. We can map where those definitions diverge today and make that part of the renewal plan, not a side project.
- EK
Ethan Kim
Buyer
That distinction helps. I’d want to see the divergence map before we bless anything for exec reporting, but conceptually, yes, that’s the right lane.
- MC
Mara Chen
Seller
Good, we’ll put the divergence map in the plan. Sofia, before we talk any expansion paths, which two outcomes should we optimize the renewal story around — activation, trial conversion, subscriber retention, Max feature adoption?
- SM
Sofia Morales
Buyer
For my side, I’d anchor it on activation and trial conversion. Subscriber retention matters, obviously, but the work my team can sponsor this quarter is first-week habit formation and paywall or trial flows. Max adoption is important too, but I’d treat that as a segment in the analysis rather than the headline renewal case.
- MC
Mara Chen
Seller
Perfect. Then I won’t make this a broad “all the things Amplitude can do” conversation. If activation and trial conversion are the anchors, I see two plausible next-term workstreams: one, an Experiment pilot around first-week habit and paywall or trial flows, with guardrails for lesson completion and engagement; and two, the governance track Ethan just described so those test readouts don’t turn into a metrics argument two weeks later. Sofia, from your side, would that combination be defensible, or would you rather separate experimentation from the renewal case and keep this mostly on analytics plus governance?
- SM
Sofia Morales
Buyer
I think the combination is defensible, as long as the Experiment piece is very scoped. One onboarding or paywall pilot, not a platform rollout. If we can show faster test readouts and fewer definition debates, I can take that upstairs.
- DP
Devon Park
Seller
Yeah, scoped is the right word. Operationally, I’d define that pilot as one surface, one primary metric, and two guardrails. So, for example: first-week activation or trial start as the primary, then lesson completion and streak continuation as guardrails so we’re not optimizing the paywall in a way that hurts learning engagement. We’d also pre-agree the warehouse validation step with Ethan’s team before the readout.
- EK
Ethan Kim
Buyer
That works for me, assuming my team gets to sanity-check the event definitions before the pilot starts, not after results are in.
- DP
Devon Park
Seller
Absolutely — pre-launch, not post-hoc. We’ll make that a gate in the pilot checklist.
- EK
Ethan Kim
Buyer
One thing I want to be explicit about: we do have internal experimentation and warehouse reporting in the mix already. So for me the bar isn’t “can Amplitude run a test,” it’s whether this reduces the cycle time for product teams without creating a second source of truth.
- DP
Devon Park
Seller
Totally fair. The pilot should not create a new truth layer. We’d treat Amplitude as the product team’s workflow for setup, segmentation, and readout, but the canonical metric definitions and final validation stay aligned to your warehouse. If we can’t show cycle-time reduction on that basis, then it’s not a good expansion case.
- SM
Sofia Morales
Buyer
That’s the right bar. For growth, I’d want to baseline how long it takes today from test idea to trusted readout, and then see if the pilot actually cuts that down. If it’s just a prettier dashboard, that won’t be enough.
- MC
Mara Chen
Seller
That’s exactly the success criterion I’d put in the renewal case: not “more dashboards,” but shorter path from idea to trusted decision. Let’s capture baseline cycle time, target reduction, and the two guardrails Devon mentioned. Then we can decide if the Experiment pilot earns broader rollout.
- EK
Ethan Kim
Buyer
Yep. And I’d want that baseline taken from our actual current workflow, not a vendor estimate.
- DP
Devon Park
Seller
Agreed. We’ll use your current workflow data — idea intake, launch approval, readout, and sign-off timestamps — and I’ll work with whoever Ethan names to map that before we touch pilot setup.
- EK
Ethan Kim
Buyer
Okay. I’ll put Priya from analytics on that. She owns the experiment metrics layer today, so she can pull the timestamps and flag any taxonomy weirdness before we scope the pilot.
- MC
Mara Chen
Seller
Perfect, thank you. So Priya owns baseline and taxonomy validation on Ethan’s side. Sofia, on the growth side, would you want the pilot anchored on onboarding activation, paywall tests, or one Super/Max flow first?
- SM
Sofia Morales
Buyer
I’d start with paywall tests, specifically trial-to-paid for high-intent learners, and use onboarding activation as the guardrail. Super/Max is important, but I don’t want the pilot to sprawl. If we can prove faster readouts there, it’s much easier for me to defend broader rollout.
- MC
Mara Chen
Seller
Great, that’s clean. Let’s make the pilot paywall trial-to-paid for high-intent learners, with onboarding activation as the guardrail and cycle time as the operating metric.
- EK
Ethan Kim
Buyer
That scope works for me. I’d just add one non-negotiable: we agree upfront which subscription metrics are source-of-truth versus exploratory, so the readout doesn’t become another metrics debate.
- DP
Devon Park
Seller
Yep, totally fair. Operationally, we’ll tag the pilot metrics in three buckets: source-of-truth subscription metrics, experiment decision metrics, and exploratory diagnostics. Priya and I can draft that before the scoping session so we’re not debating definitions in the readout.
- SM
Sofia Morales
Buyer
That makes sense. One practical question before we lock next steps: is this pilot treated as part of renewal, or is Experiment a separate commercial add-on with its own seat count?
- MC
Mara Chen
Seller
Yeah, fair question. I don’t want to guess on packaging live because it depends on seat mix and whether we structure it as a time-boxed Experiment pilot or bundle it into the renewal order form. I’ll own that with our commercial team and send two options by Friday: a pilot-only path and a renewal-plus-pilot path, both tied back to the success criteria we just defined.
- SM
Sofia Morales
Buyer
Okay, that’s fine. If you can include the seat assumptions and what changes at rollout versus pilot, I can bring it into our Monday growth review.
- MC
Mara Chen
Seller
Yep, I’ll include seat assumptions, pilot versus rollout treatment, and the two commercial paths in the Friday note. I’ll also attach the one-page value summary from today so your Monday review has the “why now” and not just pricing mechanics.
- EK
Ethan Kim
Buyer
Good. On our side, I can have analytics validate the metric buckets by Wednesday, and Sofia can bring the pilot framing Monday. Procurement-wise, assume we need a stakeholder readout the week after that if this is going into the renewal package.
- MC
Mara Chen
Seller
Perfect. Then we’ve got owners: Ethan on metric validation Wednesday, Sofia for Monday growth review, Devon for the metric-bucket draft, and me for the Friday value summary plus commercial options. I’ll propose two readout slots for the following week in the same note.
- SM
Sofia Morales
Buyer
Works for me. Mara, send it to both of us, and I’ll make sure the growth review has the right context Monday.
- MC
Mara Chen
Seller
Will do. Thanks both — this was really helpful. I’ll get the note out Friday, and Devon will send the metric draft ahead of Ethan’s Wednesday validation. We’ll keep it tight for Monday.
- EK
Ethan Kim
Buyer
Thanks, everyone. Friday note works — we’ll validate on our side and keep the stakeholder readout penciled in.
- DP
Devon Park
Seller
Thanks, everyone — I’ll send the metric-bucket draft tomorrow so Ethan has time before Wednesday. Have a good one.
How each model scored this call
Click a row to read the model's coaching note and the judge's read on it.
195gpt-5.5 mediumBestStrongly aligned with the hidden benchmark
The coach accurately recognized this as an excellent incumbent renewal QBR with strong expansion momentum. It identified the core strengths: value before expansion, Duolingo-specific executive alignment, buyer validation, source-of-truth handling, a scoped Experiment pilot, and a crisp mutual action plan. It also correctly treated the commercial/packaging deferral as a low-severity issue. The coaching is well grounded in the transcript with no material hallucinations. The only slight calibration issue is that it gives somewhat more weight to missing quantified QBR proof points and extra decision-process discovery than the hidden benchmark requires, but those are reasonable coaching improvements and do not distort the overall assessment.
- Correctly labeled the overall call as excellent and positive, matching the benchmark’s intended outcome bias.
- Accurately identified the strongest sales motion: value validation before expansion discussion.
- Strong transcript grounding throughout, with relevant quotes from Mara, Sofia, Ethan, and Devon.
- Correctly recognized Devon’s source-of-truth framing as a trust-building move rather than a weakness.
- Correctly praised the scoped Experiment pilot as outcome-tied and buyer-validated.
- Correctly treated the packaging deferral as low severity because Mara assigned a concrete owner and deadline.
- No major hidden benchmark misses. The coach found all five benchmark needles.
- The coach slightly overemphasized the need for more quantified QBR proof points. That is a reasonable improvement, but the hidden guidance explicitly says exact real-world metrics should not be required.
- The coach introduced a few additional missed opportunities, such as deeper executive approval discovery and more competitive/internal tooling exploration. These are transcript-supported and useful, but they are not central benchmark requirements.
295gpt-5.5 xhighStrong match / excellent judging by the coach
The coach output aligns very closely with the hidden ground truth. It correctly recognizes the call as an excellent incumbent renewal QBR with positive renewal-and-expansion momentum, not a closed deal. It identifies the key strengths: value-first sequencing, Duolingo-specific executive alignment, buyer-authored expansion prioritization, strong technical governance handling, and a crisp mutual action plan. It also correctly treats the packaging/pricing deferral as a minor, well-managed commercial follow-up rather than a serious flaw. The critique is mostly grounded and useful, though it slightly over-indexes on additional quantification and procurement-process gaps relative to a call that already satisfied the benchmark extremely well.
- Correctly rated the call as excellent and positive rather than forcing artificial criticism.
- Accurately identified the value-before-expansion sequencing as the core reason the expansion conversation felt earned.
- Strongly grounded praise in transcript evidence, especially Mara’s QBR caveat, Sofia’s executive-language refinement, and Mara’s adoption of that language.
- Captured the Duolingo-specific executive alignment around activation, habit formation, paywalls, trial-to-paid, retention, Super/Max, and metric confidence.
- Recognized the buyer-authored nature of the expansion plan: Sofia and Ethan shaped priorities, constraints, success criteria, and validation requirements.
- Correctly praised Devon’s technical handling of governance, warehouse validation, metric buckets, and avoiding a second source of truth.
- Correctly treated the commercial packaging deferral as a small, well-managed imperfection with a concrete Friday follow-up.
- Identified the mutual action plan with named owners and dates as a strong close.
- No major hidden-ground-truth misses. The coach found all five benchmark needles.
- The coach could have more explicitly stated that the call outcome is positive renewal and expansion momentum but not a closed deal, though this is implied throughout.
- The coach’s improvement areas are useful but somewhat more demanding than the benchmark requires, particularly around quantified proof and procurement mapping.
394gpt-5.4 noneExcellent coaching output; it correctly recognized the call as a strong incumbent renewal/QBR with earned expansion momentum and only minor, well-grounded coaching opportunities.
The coach closely matched the hidden ground truth. It identified the core strengths: value-before-expansion sequencing, Duolingo-specific business alignment, buyer-authored prioritization, disciplined handling of data-trust concerns, scoped Experiment expansion, and a crisp mutual action plan. It also correctly treated the packaging deferral as a low-severity commercial follow-up rather than a major flaw. The output is well grounded in transcript evidence and does not materially invent claims. Its improvement areas—more quantified business case, sharper differentiation versus internal tooling, clearer commercial decision criteria, and stakeholder mapping—are reasonable, though slightly more critical than the benchmark requires for an excellent call.
- Correctly identified the call’s core sequencing: QBR value validation first, then buyer priority selection, then scoped expansion, then mutual action plan.
- Strongly grounded the assessment in transcript evidence, including direct quotes from Mara, Sofia, Ethan, and Devon.
- Correctly praised the seller’s handling of Ethan’s source-of-truth and governance concerns by separating exploratory analytics from canonical warehouse validation.
- Correctly recognized Sofia’s buyer-authored phrase—faster calls on activation and trial conversion with less debate—as the central renewal value message.
- Correctly treated the Experiment expansion as a scoped pilot with success criteria rather than a broad platform upsell.
- Appropriately classified the packaging deferral as a low-severity coaching point with a concrete follow-up.
- No major hidden-needle miss. The coach covered all five benchmark needles.
- The coach could have more explicitly praised the seller’s careful qualification of QBR findings as Amplitude-side patterns rather than unsupported Duolingo performance facts.
- The coach’s recommendation to quantify more value is reasonable, but it should be balanced with the benchmark caution not to invent precise account metrics without buyer validation.
- The coach gave less emphasis to Super/Max or AI-feature adoption than the hidden ground truth included, although this was not central to the final buyer-selected pilot.
494gpt-5.4 lowExcellent coaching output with only minor over-coaching.
The coach correctly recognized the call as a strong incumbent renewal QBR with positive renewal and expansion momentum. It identified the core benchmark strengths: value before expansion, Duolingo-specific executive alignment, buyer-authored prioritization, strong governance/source-of-truth handling, and a crisp mutual action plan. The evidence is largely transcript-grounded and the recommendations are actionable. The main imperfection is that the coach somewhat over-emphasized lack of live quantification as a high-severity risk and treated the packaging deferral as more coachable than the benchmark would; hidden ground truth frames pricing/package deferral as only a minor acceptable imperfection when captured in next steps, which it was.
- Correctly praised the seller for leading with Duolingo-specific business outcomes before discussing renewal expansion.
- Accurately highlighted Mara’s use of buyer language after Sofia reframed the value around faster activation and trial-conversion decisions with less debate.
- Strongly identified Devon’s nuanced handling of source-of-truth, governance, and warehouse validation concerns.
- Correctly recognized the narrow, buyer-authored Experiment pilot as a low-risk expansion path.
- Fully captured the high-quality mutual action plan with named owners, deadlines, deliverables, and stakeholder-readout path.
- The coach slightly over-prioritized quantification as the main coaching opportunity, whereas the benchmark primarily views the call as already excellent and does not require precise metrics live.
- The coach treated commercial packaging deferral as a medium risk, while the benchmark treats it as a minor acceptable imperfection because Mara acknowledged it and assigned a dated follow-up.
- The coach could have more explicitly stated that the seller’s qualified language around Amplitude-side findings avoided unsupported metric claims, although it did mention avoiding overclaiming.
594gpt-5.5 lowExcellent match to ground truth
The coach correctly recognized this as a strong incumbent renewal QBR with positive renewal-and-expansion momentum. It captured the major benchmark strengths: value before expansion, Duolingo-specific executive alignment, buyer-authored prioritization, mature handling of source-of-truth concerns, and a concrete mutual action plan. The output is well grounded in the transcript and uses accurate evidence. Its main imperfection is that it slightly over-emphasizes quantification/commercial-discovery gaps as medium coaching risks, whereas the hidden benchmark frames the call as broadly excellent with only a minor commercial-packaging deferral.
- Correctly identified that the seller earned the right to expansion by validating current value first rather than starting with a renewal uplift or product pitch.
- Accurately praised the use of Duolingo-specific business language: activation, habit formation, trial-to-paid, paywalls, Super/Max, retention, and metric debate reduction.
- Strongly captured the source-of-truth handling: Amplitude as fast exploration/workflow layer while warehouse-aligned definitions remain canonical for executive reporting.
- Correctly recognized the buyer-authored expansion motion, especially Sofia narrowing the scope to a paywall trial-to-paid pilot and Ethan defining the source-of-truth bar.
- Very strong assessment of the mutual action plan with named owners, dates, deliverables, success criteria, and stakeholder-readout momentum.
- The coach slightly over-weighted the lack of current quantified baselines as a medium risk. The transcript shows the team agreed to baseline cycle time using Duolingo’s actual workflow, so this is more of a refinement than a meaningful flaw.
- The commercial-packaging issue was correctly identified, but the coach’s recommendation to go deeper on budget ownership and approval criteria somewhat expands beyond the hidden benchmark’s intended minor imperfection.
- The coach could have more explicitly celebrated that Mara appropriately caveated Amplitude-side findings as not Duolingo’s financial source of truth, which is an important part of avoiding unsupported metric claims.
693gpt-5.4 mediumCoach output is highly aligned with the hidden benchmark. It correctly treats the call as an excellent incumbent renewal/QBR, identifies the major strengths around value-before-expansion, Duolingo-specific executive alignment, buyer-authored expansion scoping, and a strong mutual action plan. It also correctly flags the packaging deferral as acceptable but needing crisp follow-up. The main imperfection is that the coach slightly over-emphasizes quantification/commercial qualification gaps relative to the ground truth, but those comments are transcript-grounded and do not distort the overall positive assessment.
The coach gave a strong, well-grounded evaluation. It praised the seller team for sequencing the meeting properly, using Duolingo-specific business language, handling Ethan’s source-of-truth concerns credibly, narrowing expansion to a scoped Experiment pilot, and closing with owners and dates. It also recognized the late packaging question and Mara’s disciplined deferral. There are no major hallucinations or unsupported claims. The coach’s risks are mostly fair, though somewhat more critical than the hidden profile requires, especially around the lack of quantified QBR metrics and the idea that the expansion case depends heavily on future pilot proof.
- Correctly praised the agenda structure: QBR value first, expansion only after value validation.
- Correctly recognized buyer-authored language around 'faster calls on activation and trial conversion with less debate' as central to the renewal case.
- Correctly identified Devon’s handling of Ethan’s source-of-truth concern as a major credibility move.
- Correctly praised the scoped Experiment pilot: one surface, one primary metric, guardrails, warehouse validation, and cycle-time reduction.
- Correctly highlighted the strong mutual action plan with named owners, dates, deliverables, and stakeholder readout path.
- No major hidden-ground-truth miss. The coach found all five benchmark needles.
- The coach could have more explicitly stated that the packaging deferral was only a minor imperfection in an otherwise excellent call, rather than grouping it into a medium commercial qualification risk.
- The coach could have been slightly more generous on the QBR value story, since the seller appropriately avoided unsupported precise metrics and got buyer validation of the value narrative.
793gpt-5.5 noneStrongly aligned with the ground truth; excellent coaching evaluation with minor over-critique.
The coach correctly recognized this as a highly effective incumbent renewal QBR: value was established before expansion, Duolingo-specific executive priorities were used throughout, the Experiment expansion was buyer-authored and tightly scoped, and the call closed with named owners, dates, and follow-up deliverables. The coach also caught the minor commercial-packaging deferral. The main limitations are that the coach slightly overemphasized the need for more quantification and commercial qualification, and one critique understated the qualitative adoption/value evidence that was actually present in the transcript.
- Correctly judged the call as highly effective rather than forcing artificial criticism.
- Accurately identified the value-before-expansion sequence as a major strength.
- Strongly captured Duolingo-specific executive alignment around activation, trial conversion, retention, habit formation, Super/Max, and metric confidence.
- Recognized the buyer-authored expansion motion and the disciplined narrowing to a scoped Experiment pilot.
- Well-grounded praise for handling Ethan’s source-of-truth and governance concerns without positioning Amplitude as a competing truth layer.
- Correctly called out the mutual action plan with named owners, dates, deliverables, and buyer-side responsibilities.
- The coach mildly over-weighted the commercial-packaging deferral; the transcript shows disciplined deferral with a clear Friday follow-up, which the ground truth treats as only a minor imperfection.
- The critique about needing a more quantified QBR scorecard is useful, but the coach under-credited the qualitative adoption and workflow-specific value evidence already present.
- The coach could have more explicitly stated that unsupported precise metrics would have been a negative, and that the seller’s qualified language was exactly the right approach.
893gpt-5.4 highStrong pass
The coach output is highly aligned with the hidden benchmark. It correctly recognizes the call as an excellent incumbent renewal/QBR motion, praises the seller for sequencing current value before expansion, highlights Duolingo-specific executive alignment, identifies buyer-authored expansion planning, and gives strong credit for the mutual action plan. It also catches the late packaging deferral and treats it as disciplined rather than a major flaw. The main imperfection in the coaching is a slight overemphasis on quantification and approval-process gaps relative to the benchmark’s mostly-positive profile, but those points are transcript-grounded and commercially reasonable.
- Correctly praised the seller’s sequencing: QBR value validation first, expansion planning second.
- Accurately identified the use of buyer language around faster decisions, activation, trial conversion, and less metric debate.
- Strongly captured Devon’s technical credibility around source-of-truth governance and warehouse validation.
- Correctly recognized the buyer-authored, tightly scoped Experiment pilot with guardrails and cycle-time success criteria.
- Excellent recognition of the mutual action plan with named owners, dates, deliverables, and stakeholder readout path.
- The coach did not fully frame the packaging deferral as a minor imperfection; it treated it mostly as a strength.
- The coach somewhat underplayed that the hidden benchmark views the overall call as excellent with only minor imperfection, by making quantification and approval mapping feel like larger gaps than necessary.
- The coach could have more explicitly credited Duolingo-specific Super/Max or AI-feature context, although it captured the broader subscription and growth priorities well.
993gpt-5.5 highPass — highly aligned with the hidden ground truth
The coach correctly recognized the call as an excellent incumbent renewal QBR with positive renewal and expansion momentum. It identified all major benchmark strengths: value before expansion, Duolingo-specific executive alignment, buyer-authored prioritization, a concrete mutual action plan, and the minor commercial packaging deferral. The output is well grounded in transcript evidence and offers practical coaching. The only modest issue is that it slightly over-emphasizes quantification/commercial qualification gaps as medium risks in a call the benchmark treats as already excellent, but those observations are still transcript-supported and do not materially distort the assessment.
- Correctly characterized the call as a high-performing, consultative renewal QBR rather than hunting for artificial negatives.
- Strongly identified the value-before-expansion sequence and supported it with the opening agenda and QBR validation evidence.
- Accurately praised the seller’s handling of Ethan’s source-of-truth and governance concerns, including the distinction between Amplitude as exploration workflow and warehouse-validated metrics for executive reporting.
- Correctly recognized that the Experiment expansion was scoped, buyer-led, and tied to trial-to-paid/paywall priorities rather than a broad platform pitch.
- Well-grounded praise for the mutual action plan, including named owners, Wednesday/Friday/Monday deadlines, and stakeholder-readout planning.
- No material hidden-ground-truth misses. All five benchmark needles were identified at least substantially.
- The coach slightly over-indexed on improvement areas such as quantification, approval mapping, and stakeholder-readout shaping. These are useful and transcript-grounded, but the benchmark frames the call as excellent with only a minor packaging deferral.
- The coach could have more explicitly stated that the outcome is positive renewal and expansion momentum without implying the deal is closed, though this is mostly present in its overall assessment.
1093opus 4.7 mediumexcellent
The coach output is highly aligned with the hidden benchmark. It correctly recognizes the call as a strong incumbent renewal QBR: value was established before expansion, Duolingo-specific business outcomes drove the discussion, the buyer co-authored the Experiment pilot scope, governance concerns were handled credibly, and the call closed with a real mutual action plan. The coach also correctly identifies the late packaging deferral as a manageable commercial-readiness gap. The only notable calibration issue is that the coach somewhat over-prioritizes commercial packaging and lack of quantified value as medium risks, whereas the benchmark treats the packaging deferral as a minor imperfection in an otherwise excellent call.
- Correctly characterizes the call as a strong, disciplined incumbent renewal QBR rather than forcing unnecessary criticism.
- Accurately highlights buyer-led value framing, especially Mara adopting Sofia’s executive language about faster activation and trial-conversion decisions with less debate.
- Identifies the Experiment expansion as properly scoped and buyer-authored: paywall trial-to-paid for high-intent learners, with onboarding activation and cycle time as success constraints.
- Strongly captures Devon’s technical/governance credibility around warehouse validation, avoiding a second source of truth, and metric-bucket definitions.
- Correctly praises the mutual action plan with named owners, dates, and a stakeholder/procurement path.
- Slightly over-weighted the commercial packaging deferral as a medium risk and top coaching priority; the benchmark treats it as a minor, acceptable deferral because Mara assigned a clear Friday follow-up.
- The coach could have more explicitly called out the seller’s Duolingo-specific executive preparation around Super/Max, AI-feature adoption, learner engagement, and subscriber retention, though it did reference several of these indirectly.
- The critique about lack of quantified value is fair but a bit strong for this benchmark; the call appropriately used qualified QBR patterns and buyer validation rather than unsupported hard metrics.
1193deepseek v4 proStrong pass: the coach accurately recognized this as an excellent incumbent QBR/renewal-expansion call and captured nearly all benchmark strengths.
The coach output is highly aligned with the hidden ground truth. It correctly praised the seller for sequencing value validation before expansion, using Duolingo-specific business language, co-creating a scoped Experiment pilot, addressing Ethan’s governance concerns, and closing with a concrete mutual action plan. It also correctly noticed the late commercial/packaging deferral and treated it as transparent, manageable follow-up rather than a serious flaw. Minor issues: the coach slightly over-indexed on additional quantification and resource-dependency coaching that were not central benchmark gaps, and it did not explicitly emphasize the seller’s careful caveating of Amplitude-side usage data versus Duolingo’s financial/source-of-truth metrics.
- Correctly identified the call as exceptional rather than manufacturing major criticism.
- Accurately praised the seller’s sequencing: value validation first, buyer priority confirmation second, scoped expansion third, mutual action plan last.
- Strongly captured Ethan’s metric-consistency concern and Devon’s governance response as a key technical and political unlock.
- Recognized that the Experiment pilot was scoped around business outcomes and operating metrics, not positioned as a broad platform rollout.
- Precisely cited the final mutual action plan with named owners, dates, deliverables, and stakeholder readout path.
- Did not explicitly highlight Mara’s careful caveat that Amplitude-side usage patterns were not Duolingo’s financial source of truth, which was an important credibility move in the benchmark.
- Slightly underplayed the Duolingo-specific Super/Max or AI-feature-adoption language, though the buyer ultimately deprioritized Max as the headline case.
- Added low-priority coaching around Priya’s availability and earlier commercial discovery; both are plausible but not central to the benchmark’s main evaluation criteria.
1293gpt-5.4 xhighStrong pass
The coach output is well aligned to the hidden ground truth. It correctly recognizes the call as an excellent incumbent renewal QBR, identifies the value-before-expansion sequencing, Duolingo-specific executive alignment, credible handling of metric governance/source-of-truth concerns, buyer-authored Experiment pilot scoping, and a very strong mutual action plan. It also correctly treats the packaging/pricing deferral as acceptable commercial discipline with a follow-up rather than a serious flaw. The coaching is transcript-grounded and commercially sensible. The only slight caveat is that it adds several improvement areas—quantified proof, stakeholder mapping, current workflow discovery—that are valid but could be weighted a bit heavily for a call the benchmark views as already excellent.
- Correctly assessed the overall call as high-quality and deal-advancing rather than forcing artificial negativity.
- Accurately identified the value-before-expansion sequencing as a major strength.
- Strongly captured the source-of-truth and governance handling with Ethan, including the exploration layer versus canonical warehouse framing.
- Correctly praised the buyer-authored, tightly scoped Experiment pilot with primary metric, guardrails, baseline cycle time, and pre-launch validation.
- Correctly recognized the close as a strong mutual action plan with named owners, deadlines, and buyer-side commitments.
- Handled the packaging deferral with the right nuance: acceptable not to guess live, but follow up with concrete commercial options.
- The coach could have more explicitly named the buyer-authored prioritization behavior as a strategic strength, not just the resulting pilot design.
- The coach did not heavily emphasize the Duolingo-specific Super/Max/AI-feature context, although this is a minor miss because the buyer deliberately chose trial-to-paid and activation as the anchor.
- The improvement plan is somewhat extensive for a benchmark-excellent call; the coach’s risks are mostly valid, but the weighting could be slightly more celebratory and less remedial.
1393opus 4.7 maxExcellent coach output with one calibration issue
The coach correctly recognized the call as a strong renewal QBR and captured all major hidden benchmark themes: value before expansion, Duolingo-specific executive alignment, buyer-authored expansion scoping, operational rigor around Experiment/governance, and a concrete mutual action plan. The evaluation is well grounded in transcript evidence and appropriately positive overall. The main weakness is that the coach slightly overstates the packaging deferral as lacking any directional commercial frame, even though Mara did name the two likely structures and committed to Friday options. That should be treated as a minor imperfection, not a meaningful sales risk.
- Correctly labeled the call as a strong, executive-grade renewal QBR rather than searching for artificial negatives.
- Accurately identified the value-before-expansion sequence and the buyer validation from Sofia before moving into Experiment.
- Captured the strongest executive-value phrase: faster calls on activation and trial conversion with less metric debate.
- Recognized Devon’s operational rigor around one surface, one primary metric, guardrails, pre-launch validation, source-of-truth alignment, and metric buckets.
- Detailed the mutual action plan with real owners and dates, including Priya, Ethan, Sofia, Devon, Mara, Wednesday validation, Monday growth review, Friday note, and the following-week stakeholder readout.
- The coach slightly over-penalized the packaging moment by calling it medium severity and saying there was no directional frame, despite Mara naming the two likely commercial structures.
- Some additional missed opportunities, such as parking Q+1 expansion threads or commercially framing governance, are plausible but not central to the hidden benchmark and should not materially reduce the excellent rating.
- The coach could have more explicitly stated that the pricing/packaging issue is the benchmark’s intended minor imperfection and does not undermine the overall renewal/expansion momentum.
1492gemini 3.1 pro previewstrong pass
The coach accurately recognized this as an excellent incumbent renewal QBR with strong value validation, buyer-authored expansion planning, technical objection handling, and a concrete mutual action plan. It correctly caught the minor commercial-packaging deferral and did not manufacture major flaws. The main gaps are that it under-emphasized the early QBR value narrative/adoption evidence and slightly over-prioritized commercial ambiguity and executive stakeholder mapping relative to the hidden benchmark’s strongest excellence markers.
- Correctly judged the overall call as exceptionally strong rather than forcing unnecessary criticism.
- Accurately praised Mara’s mirroring of Sofia’s language: faster calls on activation and trial conversion with less metric debate.
- Strongly identified Devon’s technical handling of Ethan’s source-of-truth concern by positioning Amplitude as workflow/exploration rather than a competing truth layer.
- Correctly recognized the scoped Experiment pilot design with one surface, primary metric, guardrails, baseline cycle time, and warehouse validation.
- Correctly flagged the late packaging/commercial deferral as a minor improvement area with a concrete follow-up.
- The coach did not fully unpack the early QBR value readout: adoption across product, growth, data, lifecycle, and subscription teams; onboarding/streak/paywall workflows; and buyer validation before expansion.
- It under-mentioned some Duolingo-specific executive themes present in the transcript, including Super/Max engagement, learner habit formation, streak continuation, lesson completion, and AI-feature-adjacent measurement.
- It slightly over-prioritized commercial ambiguity and stakeholder mapping as coaching-plan items relative to the benchmark’s main lesson: this was an excellent buyer-authored renewal/expansion plan with only a minor packaging gap.
- It did not explicitly praise Mara’s careful qualification of Amplitude-side findings versus Duolingo’s source of truth, which was important to avoiding unsupported metric claims.
1590opus 4.7 highStrong pass: the coach output is well aligned with the hidden benchmark, with only mild over-coaching of the commercial-packaging deferral.
The coach correctly recognized the call as a strong incumbent renewal QBR: Amplitude led with value, used Duolingo-specific business language, invited buyer prioritization, co-authored a scoped Experiment/governance pilot, and closed with a concrete mutual action plan. The output is highly transcript-grounded and captures all four major strengths. Its main imperfection is that it treats the packaging deferral as a medium commercial-acumen risk, whereas the benchmark frames it as a minor, acceptable imperfection because Mara acknowledged the question and committed to concrete Friday follow-up options.
- Correctly framed the overall call as a strong consultative renewal motion rather than forcing unnecessary criticism.
- Accurately identified the seller’s value-before-expansion sequencing as a major strength.
- Highlighted the high-credibility handling of Ethan’s source-of-truth concern and Amplitude’s role as exploration/workflow layer rather than canonical metric layer.
- Captured the buyer-authored prioritization around activation, trial conversion, paywall tests, guardrails, and cycle-time reduction.
- Recognized the strong mutual action plan with named owners, dates, deliverables, and stakeholder-readout path.
- The coach somewhat over-penalized the commercial packaging deferral; the benchmark views it as a small, acceptable gap because it was acknowledged and converted into a concrete follow-up.
- The output adds several non-benchmark missed opportunities, such as internal experimentation probing, seat expansion, and Max/AI monetization. These are mostly grounded and useful, but they slightly dilute the benchmark’s core message that this was an excellent call.
- The coach could have stated more explicitly that the call outcome shows positive renewal and expansion momentum without implying the deal was closed.
1690opus 4.7 xhighStrong pass: the coach accurately recognized this as an excellent incumbent renewal QBR with value-first sequencing, Duolingo-specific executive alignment, buyer-authored expansion, and a crisp mutual action plan. The main weakness is slight over-coaching: the model treated the packaging deferral and a few optional additions as more material risks than the benchmark intended.
The coach captured all major hidden ground-truth strengths and used transcript-grounded evidence well. It correctly praised Mara and Devon for validating current value before discussing expansion, adopting Sofia’s executive framing, respecting Ethan’s warehouse/source-of-truth concerns, narrowing the Experiment pilot based on buyer priorities, and closing with named owners and dates. It also identified the minor packaging/pricing deferral, but somewhat over-weighted it as a medium/P1 risk even though the benchmark treats it as a small, well-handled imperfection. A few missed-opportunity claims, such as AI/Max measurement, peer reference proof, and governance commercialization, are plausible but less central and partly speculative relative to the transcript.
- Correctly identified the value-first, expansion-second structure as a major strength.
- Correctly highlighted the seller’s adoption of Sofia’s executive language: faster calls on activation and trial conversion with less metric debate.
- Strongly captured Devon’s technical credibility around warehouse validation, source-of-truth boundaries, divergence mapping, and metric buckets.
- Accurately recognized that the Experiment pilot became buyer-authored and tightly scoped rather than vendor-pushed.
- Accurately captured the final mutual action plan with named owners, dates, deliverables, and stakeholder readout path.
- Slightly over-penalized the commercial packaging deferral despite the seller handling it cleanly with a specific follow-up.
- Introduced some generic missed opportunities, such as peer references and governance commercialization, that are not central to the benchmark.
- The AI/Max critique underweights the fact that Sofia intentionally deprioritized Max to prevent pilot sprawl.
- The category scores of 7 for value/commercial and 6 for executive orchestration are somewhat harsher than the hidden ground truth’s “excellent” profile warrants.
1788opus 4.7 lowpass_strong
The coach output is highly aligned with the hidden ground truth. It correctly recognizes that this was a strong renewal QBR: value was established before expansion, Duolingo-specific priorities were used, expansion was buyer-authored, the Experiment pilot was tightly scoped, source-of-truth concerns were handled well, and the call ended with a concrete mutual action plan. The main weakness is calibration: the coach slightly under-rates an excellent call as merely “above-average” and over-weights the packaging deferral and lack of quantified impact as medium risks, even though the benchmark treats packaging deferral as a minor imperfection and encourages avoiding unsupported precise metrics.
- Correctly recognized the value-before-expansion sequence and praised Mara for validating the QBR story before introducing Experiment.
- Accurately highlighted the executive value reframing from dashboard usage to faster activation/trial-conversion decisions with less metric debate.
- Strongly identified the buyer-authored pilot scope: paywall trial-to-paid for high-intent learners, onboarding activation as guardrail, and cycle-time reduction as the operating metric.
- Well-grounded technical praise for Devon’s handling of source-of-truth and warehouse validation concerns.
- Correctly noted the strong mutual action plan with owners, dates, deliverables, and stakeholder/procurement path.
- The coach under-called the overall call quality by describing it as “above-average” rather than clearly excellent, despite matching nearly all excellence markers.
- It over-weighted commercial packaging deferral, which the benchmark treats as a small and well-managed imperfection.
- It over-emphasized the absence of quantified impact, potentially encouraging unsupported metric claims when the seller appropriately deferred quantification to buyer-validated baselining.
- Some low-priority missed opportunities around retention and Max/AI are plausible but could conflict with the buyer’s explicit request to keep the pilot tightly scoped.
1888sonnet 4.6WorstStrong evaluation with a few over-coaching / false-positive risks
The coach correctly recognized the call as an excellent incumbent renewal QBR and captured nearly all of the benchmark strengths: value before expansion, Duolingo-specific executive alignment, buyer-authored expansion prioritization, and a crisp mutual action plan. Its evidence is mostly transcript-grounded and its sales instincts are strong. The main weakness is that it overstates some risks that the hidden benchmark treats as refinements at most, especially the need for quantified ROI, competitive risk from internal experimentation, and peer benchmarks. It also treats the packaging deferral mostly as a strength rather than the minor imperfection identified in the benchmark, though it does recognize the behavior and the concrete follow-up.
- Correctly rated the call as excellent rather than forcing negative feedback.
- Accurately identified the disciplined sequence: QBR value validation before expansion discussion.
- Strongly captured buyer-authored prioritization around activation, trial conversion, and a scoped Experiment pilot.
- Highlighted Devon’s technically credible handling of governance, source-of-truth, metric buckets, and guardrails.
- Correctly praised the mutual action plan with named owners, dates, success criteria, and stakeholder readout.
- Used direct transcript quotes effectively and generally tied claims to real evidence.
- Did not fully align with the benchmark’s treatment of packaging deferral as the main minor imperfection; it framed the deferral almost entirely as a strength.
- Over-indexed on quantified ROI as a gap despite the benchmark’s caution against unsupported account-specific metrics.
- Elevated internal experimentation to a high-severity competitive risk even though the seller team addressed it well and converted it into clear pilot success criteria.
- Added optional coaching ideas, such as peer benchmarks and Max/AI future hooks, that are not wrong but are less central than the hidden benchmark priorities.