48 scenarios · 3 trials each · seeds pinned per trial
Review before deploy. The challenger improves resolution quality on 38 of 48 scenarios and reduces cost by 9.8%, but introduces one P0 regression in `refund_policy_edge_case_17`, an unauthorized refund path the baseline correctly blocked. Latency is within the configured SLO but trends upward; investigate the tool-retry loop in `vip_refund_override` before next run.
Top scenarios this run
Scenario
Baseline
Challenger
Delta
Evidence
refund_policy_edge_case_17
pass
fail
new P0
vip_refund_override
pass
warn
+340ms
partial_refund_calculation
pass
warn
+0.04 retry
Hover a metric above to see which scenarios drove it. Click a scenario row to open its evidence bundle.
Recommendation
Review before deploy.
The challenger is the better model on quality and cost, but it introduces an unauthorized billing-tool path that the baseline refused. Recommend a targeted fix on the policy-gate prompt section and a re-run of the regression cluster before promoting to production.
Quality improved on 38 / 48 scenarios; 4 previously failing scenarios now pass.
Cost reduced 9.8% per run; under the $0.40 / run budget.
Latency p95 grew 220ms, within the 1.5s SLO but worth watching.
1 new P0: `refund_policy_edge_case_17` (override=true on a closed dispute).
Trial distribution
p50 980ms · p95 1.40s · max 2.18s across 144 trials