A shared evaluation framework for Commercial Journeys, covering the full Journey lifecycle: Recommendation Quality (pre-click) and Output Quality (post-click).
Every metric is evaluated along two independent dimensions. Both must be defined for each metric.
How do we judge a single Journey on this metric?
| Type | Judgment | Meaning |
|---|---|---|
| 🔴 Pure Gate | Pass / Fail | Binary. The Journey either meets the bar or it doesn’t. No partial credit. |
| 🟡 Has Gate Threshold | Pass / Fail with two severity levels | Same metric measures two kinds of failure: severe (gate) and mild (quality). Each level has its own tolerance. |
| 🟢 Pure Quality | 1–5 Score | Graded on a spectrum. No removal — only quality improvement. |
Across a batch of Journeys, how many failures do we allow?
| Tolerance | Definition | When to use |
|---|---|---|
| Zero Tolerance | 100% must pass. A single failure blocks release. | Safety, compliance, privacy — any failure is a trust catastrophe. |
| Partial Tolerance | ≤ X% may fail (or ≥ Y% must pass). Defined per metric. | Most metrics — real-world signals are noisy. |
Evaluates whether the system recommends the right Journeys, in the right order, with clear and honest presentation.
4 categories, 13 metrics. Evaluate whether each individual Journey is compliant, safe, eligible, correctly understood, and clearly presented.
Using out-of-scope data (tenant boundary, retention, consent, permission) is a compliance violation that can expose Microsoft to legal liability.
Any Journey that references data outside the user’s permitted scope (wrong tenant, expired retention, no consent, higher permission tier).
Zero Tolerance. 100% compliance rate. A single violation blocks release.
Exposing sensitive content on a visible card layer is a trust catastrophe and compliance incident.
Any instance where the card layer surfaces sensitive content (PII, health, financial, HR, legal, credentials).
Zero Tolerance. 100% block rate on sensitive-tagged NEG test set.
A Journey for non-actionable information has zero user value and trains the user to ignore the feature.
Journey is generated from a non-task signal: mass email, notification, FYI-only item, or background noise.
Partial Tolerance. Non-task rate ≤ 2% of all generated Journeys.
Showing a Journey for a completed/cancelled task signals the system is out of date and erodes trust.
Task has clear completion/cancellation/delegation signals yet Journey is still surfaced.
Partial Tolerance. Stale-task rate ≤ 5%.
Recommending AI help for trivial tasks insults the user and erodes perceived value of the feature.
Task requires ≤ 1 step or ≤ 30 seconds to complete without AI. No synthesis, drafting, or research needed.
Partial Tolerance. Trivial-task rate ≤ 5%.
Tasks requiring physical presence, emotional judgment, or actions AI cannot perform create false promises.
Task completion requires actions AI fundamentally cannot perform: physical action, real-time human interaction, or purely relational judgment.
Partial Tolerance. AI-unfit rate ≤ 3%.
A fabricated task wastes user time and destroys trust. A real task with minor errors is annoying but recoverable.
Journey describes a task that does not exist in the user’s actual work context.
5 = all details (goal, deadline, stakeholder, action) perfectly accurate; 4 = one minor inaccuracy (e.g., off-by-one-day deadline); 3 = notable errors but task is recognizable; 2 = multiple major errors; 1 = barely resembles the real task.
Gate level: Zero Tolerance. Phantom task rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Hallucinated details destroy user trust and can lead to embarrassing or incorrect actions.
A core claim (person, event, deadline, document) is entirely fabricated with no source signal.
5 = every claim precisely matches source; 4 = one minor paraphrasing drift; 3 = noticeable approximation gaps; 2 = multiple unsupported inferences; 1 = mostly ungrounded narrative.
Gate level: Zero Tolerance. Full hallucination rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Too broad = user can’t act; too narrow = trivial sub-step that doesn’t warrant a Journey card.
5 = perfect granularity; 4 = slightly too broad/narrow; 3 = noticeably off; 2 = significantly misscoped; 1 = unusable scope.
Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Recommending tasks the user isn’t responsible for wastes attention and signals poor understanding of role context.
Task clearly belongs to someone else (user is CC, optional attendee, or task was explicitly delegated away).
5 = unambiguous ownership (direct assignee, sole recipient, explicit request); 4 = strong signals (primary on thread, named in action); 3 = reasonable but debatable; 2 = weak signals, likely wrong user; 1 = clearly someone else’s task.
Gate level: Partial Tolerance. Wrong-owner rate ≤ 3%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
If the user can’t understand the card in 3 seconds, they skip it.
5 = instantly clear; 4 = clear with brief thought; 3 = requires re-reading; 2 = confusing; 1 = incomprehensible.
Partial Tolerance. ≥ 85% of Journeys score ≥ 4.
Fabricated urgency signals destroy trust faster than missing signals. Users rely on reason labels to decide priority.
Reason label claims an urgency/trigger that has no basis in source data.
5 = reason label precisely matches evidence (correct trigger, correct timing); 4 = directionally correct with minor imprecision; 3 = loosely supported; 2 = misleading framing of real signal; 1 = reason contradicts source data.
Gate level: Zero Tolerance. Fabricated reason rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Over-promising and under-delivering is the fastest way to kill repeat usage.
Card promises something the system fundamentally cannot deliver (e.g., write access it doesn’t have).
5 = output matches or exceeds card promise; 4 = slight under-delivery on one aspect; 3 = noticeable gap between promise and output; 2 = significant over-promise; 1 = card promise is completely unmet despite being technically possible.
Gate level: Zero Tolerance. Impossible promise rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
5 categories, 5 metrics. Evaluate the set of Journeys presented together as a slate — ranking, coverage, diversity, and deduplication.
Missing important tasks is the most damaging failure for a proactive assistant — user loses trust that the system has their back.
5 = all important tasks covered; 4 = one minor miss; 3 = notable gaps; 2 = major tasks missing; 1 = slate misses most important work.
Partial Tolerance. ≥ 80% of slates score ≥ 4 on coverage.
Users look at the top few items first. Poor ranking means the most important tasks are buried.
5 = perfect priority order; 4 = minor swap needed; 3 = noticeably wrong order; 2 = important items buried; 1 = random/inverse order.
Partial Tolerance. ≥ 80% of slates score ≥ 4.
Top 3 is the “hero zone” — most users only engage with the first few items. Getting these wrong is the highest-impact ranking failure.
5 = all 3 are the right picks; 4 = 2 of 3 correct; 3 = 1 of 3 correct; 2 = none correct but relevant; 1 = irrelevant items in top 3.
Partial Tolerance. ≥ 75% of slates score ≥ 4.
A slate dominated by one trigger (e.g., 5 Journeys from same email) feels broken and misses other important work.
5 = well-balanced coverage; 4 = slightly concentrated; 3 = noticeably dominated by one source; 2 = heavily skewed; 1 = all from single trigger.
Partial Tolerance. ≥ 80% of slates score ≥ 4.
Duplicates waste slots and feel broken. Bad splits confuse; bad merges lose task identity.
Two Journeys in the same slate describe the exact same task (same action, same object, same context).
5 = every Journey maps to exactly one distinct task, no fragmentation or merging; 4 = one borderline split/merge case; 3 = noticeable boundary issues (2+ cases); 2 = significant fragmentation or loss from merging; 1 = slate is riddled with split/merge problems.
Gate level: Zero Tolerance. Exact duplicate rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of slates score ≥ 4.
Evaluates whether the AI output delivered after the user clicks a Journey card fulfills the promise, is correct, and is useful. 3 categories, 6 metrics.
The card sets an expectation. If the output doesn’t match, user feels deceived regardless of output quality.
Output is about a different topic or task than what was promised on the card.
5 = output fully delivers everything the card promised; 4 = one minor element missing; 3 = right topic but notable gaps vs. promise; 2 = significant under-delivery; 1 = barely related to promise.
Gate level: Zero Tolerance. Complete mismatch rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Hallucinated facts in outputs can lead to incorrect actions with real business consequences.
Output contains a factual claim (name, date, number, decision) with no basis in source data.
5 = every fact precisely matches source; 4 = one minor imprecision (rounded number, approximate time); 3 = noticeable inaccuracies but gist correct; 2 = multiple factual errors; 1 = output is largely inaccurate.
Gate level: Zero Tolerance. Fabricated fact rate = 0%.
Quality level (1–5): Partial Tolerance. ≥ 80% of Journeys score ≥ 4.
Incomplete output forces the user to find and fill gaps, reducing time savings.
5 = all key information covered; 4 = one minor gap; 3 = notable gaps; 2 = major omissions; 1 = barely started.
Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
Wrong format adds conversion work. A draft email should look like an email; a meeting prep should be structured talking points.
5 = perfect scenario match; 4 = acceptable format; 3 = workable but not ideal; 2 = awkward format; 1 = completely wrong format.
Partial Tolerance. ≥ 85% of Journeys score ≥ 4.
Output that requires significant rework defeats the purpose of proactive AI assistance.
5 = directly usable as-is; 4 = minor edits needed; 3 = moderate rework; 2 = heavy rework; 1 = start over.
Partial Tolerance. ≥ 75% of Journeys score ≥ 4.
The ultimate measure of output value: did it actually move the user forward on their task?
5 = task meaningfully advanced, clear next step; 4 = mostly advanced, minor gap; 3 = some progress; 2 = marginal help; 1 = no advancement, user still at square one.
Partial Tolerance. ≥ 70% of Journeys score ≥ 4.
Evaluates each individual Journey within a slate. Is the Journey compliant, a real work task, accurately described, and clearly presented?
Evaluates the full set of Journeys as a collection. Is the ranking good, are important tasks covered, is there duplication?
Relationship: L1 examines each Journey in isolation; L2 examines the group as a whole. Both are assessed independently and produce separate conclusions.
The input to a machine eval run is a Batch containing M eval units. Each eval unit is one user’s full context + the ordered set of Journeys (slate) generated by the prompt for that context.
Example: a batch of 10 users, each with 3–7 Journeys in their slate, totaling 50 Journeys. Then M = 10, N = 50.
The machine judge receives one eval unit: a user’s context + the ordered slate of Journeys generated for that context. It scores every sub-check for every Journey (L1) and for the slate as a whole (L2).
For each Journey in the slate, the judge evaluates 18 sub-checks against the user context. Each Journey produces an independent L1 metric vector.
| Sub-check | Journey 1 | Journey 2 | Journey 3 | Journey 4 |
|---|---|---|---|---|
| 1.1_gate | pass | pass | pass | pass |
| 1.2_gate | pass | pass | pass | pass |
| 2.1_gate | pass | pass | fail | pass |
| 2.2_gate | pass | fail | pass | pass |
| 2.3_gate | pass | pass | pass | pass |
| 2.4_gate | pass | pass | pass | pass |
| 3.1_gate | pass | pass | pass | pass |
| 3.1_quality | 5 | 4 | 3 | 4 |
| 3.2_gate | pass | pass | pass | pass |
| 3.2_quality | 4 | 3 | 2 | 5 |
| 3.3_quality | 4 | 4 | 3 | 5 |
| 3.4_gate | pass | pass | pass | pass |
| 3.4_quality | 3 | 4 | 2 | 4 |
| 4.1_quality | 5 | 4 | 4 | 5 |
| 4.2_gate | pass | pass | pass | pass |
| 4.2_quality | 4 | 3 | 4 | 5 |
| 4.3_gate | pass | pass | pass | pass |
| 4.3_quality | 4 | 4 | 3 | 4 |
The same slate is evaluated as a whole on 6 sub-checks covering coverage, ranking, diversity, and deduplication.
| Sub-check | User A’s Slate |
|---|---|
| 5.1_coverage | 4 |
| 5.2_ranking | 3 |
| 5.3_top3 | 3 |
| 5.4_diversity | 5 |
| 5.5_gate | pass |
| 5.5_quality | 4 |
A batch of M users all go through this process. In our running example: M=10, N=50.
After Phase 1 completes for all M users (totaling N Journeys), each sub-check is aggregated across the full batch.
Gate sub-checks (pass/fail): Aggregated as failure rate = fail count / total count.
Quality sub-checks (1–5 score): Multiple statistics are produced, not just pass rate. The reason: pass rate depends on the “≥4 counts as pass” bar, which may need adjustment during framework tuning. We output:
This way, if the bar is later adjusted (e.g., ≥3 becomes acceptable for a metric), re-calculation uses the distribution directly—no re-running eval.
| Sub-check | Type | Failure Rate | Pass Rate (≥4) | Mean | Distribution (1/2/3/4/5) |
|---|---|---|---|---|---|
| 1.1_gate | gate | 0% (0/50) | — | — | — |
| 1.2_gate | gate | 0% (0/50) | — | — | — |
| 2.1_gate | gate | 4% (2/50) | — | — | — |
| 2.2_gate | gate | 6% (3/50) | — | — | — |
| 2.3_gate | gate | 2% (1/50) | — | — | — |
| 2.4_gate | gate | 2% (1/50) | — | — | — |
| 3.1_gate | gate | 0% (0/50) | — | — | — |
| 3.1_quality | quality | — | 72% (36/50) | 3.8 | 0 / 3 / 11 / 28 / 8 |
| 3.2_gate | gate | 0% (0/50) | — | — | — |
| 3.2_quality | quality | — | 80% (40/50) | 4.0 | 0 / 2 / 8 / 30 / 10 |
| 3.3_quality | quality | — | 84% (42/50) | 4.1 | 0 / 1 / 7 / 28 / 14 |
| 3.4_gate | gate | 2% (1/50) | — | — | — |
| 3.4_quality | quality | — | 70% (35/50) | 3.7 | 0 / 4 / 11 / 25 / 10 |
| 4.1_quality | quality | — | 86% (43/50) | 4.2 | 0 / 0 / 7 / 25 / 18 |
| 4.2_gate | gate | 0% (0/50) | — | — | — |
| 4.2_quality | quality | — | 78% (39/50) | 3.9 | 0 / 1 / 10 / 28 / 11 |
| 4.3_gate | gate | 0% (0/50) | — | — | — |
| 4.3_quality | quality | — | 76% (38/50) | 3.9 | 0 / 2 / 10 / 26 / 12 |
| Sub-check | Type | Failure Rate | Pass Rate (≥4) | Mean | Distribution (1/2/3/4/5) |
|---|---|---|---|---|---|
| 5.1_quality | quality | — | 70% (7/10) | 3.8 | 0 / 0 / 3 / 5 / 2 |
| 5.2_quality | quality | — | 80% (8/10) | 4.0 | 0 / 0 / 2 / 6 / 2 |
| 5.3_quality | quality | — | 60% (6/10) | 3.5 | 0 / 1 / 3 / 4 / 2 |
| 5.4_quality | quality | — | 80% (8/10) | 4.1 | 0 / 0 / 2 / 5 / 3 |
| 5.5_gate | gate | 10% (1/10) | — | — | — |
| 5.5_quality | quality | — | 80% (8/10) | 4.0 | 0 / 0 / 2 / 6 / 2 |
This phase has two parts: a hard gate check (Zero Tolerance), then layered weighted scoring for Partial Tolerance metrics.
Scan all Zero Tolerance sub-checks. If any has failure rate > 0%, the prompt is immediately judged FAIL.
| Sub-check | Tolerance | Failure Rate | Pass? |
|---|---|---|---|
| 1.1_gate | Zero | 0% | |
| 1.2_gate | Zero | 0% | |
| 3.1_gate | Zero | 0% | |
| 3.2_gate | Zero | 0% | |
| 4.2_gate | Zero | 0% | |
| 4.3_gate | Zero | 0% | |
| 5.5_gate | Zero | 10% |
For all Partial Tolerance sub-checks, compute a normalized 0–1 score, then aggregate upward through Category → Level → Overall.
| Sub-check | Observed | Threshold | Normalized Score | Pass? |
|---|---|---|---|---|
| 2.1_gate | 4% | ≤ 2% | 0.50 | |
| 2.2_gate | 6% | ≤ 5% | 0.80 | |
| 2.3_gate | 2% | ≤ 5% | 1.00 | |
| 2.4_gate | 2% | ≤ 3% | 1.00 | |
| 3.1_quality | 72% | ≥ 75% | 0.96 | |
| 3.2_quality | 80% | ≥ 80% | 1.00 | |
| 3.3_quality | 84% | ≥ 80% | 1.00 | |
| 3.4_gate | 2% | ≤ 3% | 1.00 | |
| 3.4_quality | 70% | ≥ 75% | 0.93 | |
| 4.1_quality | 86% | ≥ 85% | 1.00 | |
| 4.2_quality | 78% | ≥ 80% | 0.98 | |
| 4.3_quality | 76% | ≥ 75% | 1.00 | |
| 5.1_quality | 70% | ≥ 80% | 0.88 | |
| 5.2_quality | 80% | ≥ 80% | 1.00 | |
| 5.3_quality | 60% | ≥ 75% | 0.80 | |
| 5.4_quality | 80% | ≥ 80% | 1.00 | |
| 5.5_quality | 80% | ≥ 80% | 1.00 |
Sub-check scores within a category are averaged (equal weight by default) to produce a category score.
| Category | Sub-checks (scores) | Weighting | Category Score |
|---|---|---|---|
| Cat 1: Safety | All Zero Tolerance — handled in Part A | ||
| Cat 2: Eligibility | 2.1(0.50), 2.2(0.80), 2.3(1.00), 2.4(1.00) | Equal | 0.83 |
| Cat 3: Task Understanding | 3.1q(0.96), 3.2q(1.00), 3.3(1.00), 3.4g(1.00), 3.4q(0.93) | Equal | 0.98 |
| Cat 4: Presentation | 4.1(1.00), 4.2q(0.98), 4.3q(1.00) | Equal | 0.99 |
| Cat 5: Coverage | 5.1(0.88) | — | 0.88 |
| Cat 6: Prioritization | 5.2(1.00) | — | 1.00 |
| Cat 7: Top-N | 5.3(0.80) | — | 0.80 |
| Cat 8: Portfolio | 5.4(1.00) | — | 1.00 |
| Cat 9: Set Hygiene | 5.5q(1.00) | — | 1.00 |
Category scores are weighted into Level scores.
| Level | Categories | Weights | Level Score |
|---|---|---|---|
| L1: Single-Journey | Cat 2 (0.83), Cat 3 (0.98), Cat 4 (0.99) | 30% / 40% / 30% | 0.94 |
| L2: Slate-Level | Coverage(0.88), Prioritization(1.00), Top-N(0.80), Portfolio(1.00), Hygiene(1.00) | Equal (20% each) | 0.94 |
| Layer | Score | Weight | Rationale |
|---|---|---|---|
| L1 Score | 0.94 | 60% | Individual Journey quality is foundational |
| L2 Score | 0.94 | 40% | Slate quality enhances overall experience |
The eval system produces a structured report combining the hard verdict with full quality diagnostics and iteration guidance.
All weights are initial suggested values. The selection logic:
| Weight Decision | Initial Value | Rationale |
|---|---|---|
| Metrics within a Category | Equal weight | No prior reason to favor one metric over another; calibrate after experience. |
| L1: Cat 2 vs Cat 3 vs Cat 4 | 30% / 40% / 30% | Task Understanding (Cat 3) is the foundation for everything else; Eligibility and Presentation are equally important relative to each other. |
| L2: 5 categories | Equal weight (20% each) | Same rationale — calibrate after experience. |
| L1 vs L2 | 60% / 40% | Individual Journey quality is more foundational; slate quality enhances overall experience. |
After receiving an eval report, the team compares scores against actual user experience: