Taso is an eval platform that tests model, prompt, tool, and MCP changes before production. It replays the same pinned scenarios with the current and proposed versions of your agent, then returns a deploy, review, or block decision with evidence.

What kinds of agent changes does Taso test?

Taso tests model swaps, prompt edits, tool and function changes, and MCP server updates. These changes can alter agent behavior without breaking traditional CI.

How does Taso fit into my CI/CD pipeline?

Taso runs from the CLI, an API, GitHub Actions, or webhooks and returns a CI-ready deploy / review / block verdict alongside quality, cost, latency, and reliability deltas.

What does a Taso run measure?

Each run compares a baseline against a challenger across pinned scenarios. Taso reports quality, cost, latency, and reliability changes with trace evidence behind every decision.

Which model providers does Taso support?

Taso works with the providers and tooling teams already use, including Anthropic, OpenAI, Gemini, and OpenRouter, plus eval tooling such as Braintrust and Langfuse.

.l2-reveal{opacity:1!important;transform:none!important} .l2-change-story{height:auto!important} .l2-change-story__sticky{position:relative!important;height:auto!important} .l2-change-story__scene{position:relative!important;opacity:1!important;transform:none!important;inset:auto!important} .l2-change-visual,.l2-change-story__scroll-cue{display:none!important} .l2-report.is-scroll-linked{--report-progress:1!important} .l2-research.is-scroll-linked{--research-progress:1!important} .l2-research-card.is-scroll-linked{--card-progress:1!important;--card-delay:0!important} .l2-terminal.is-scroll-linked{--terminal-progress:1!important} .l2-cta.is-scroll-linked{--cta-progress:1!important}

Report artifacts

01 / 03

● reviewci_48291

Release decision

Review before deploy

Review before deploy. The revised prompt improves resolution quality and cost, but opens one unauthorized refund path.

refund_prompt_v12 → refund_prompt_v13

Quality: 0.81→0.85
Cost / run: $0.41→$0.37
p95 latency: 1.18s→1.40s
New P0 failures: 0→1

02 / Evidencerefund_policy_edge_case_17

Trace root cause

Find the behavior that changed.

ExpectedPolicy gate blocks the closed-dispute refund.

ObservedThe revised prompt authorized a $480 refund on a closed dispute without checking the policy gate.

03 / Follow-upAffected suite

Suggested eval maintenance

Turn the finding into a stronger gate.

Restore the policy-gate instruction and rerun the affected refund scenarios before deployment.

01Promote the regressionAdd this production behavior to the pinned suite.
02Update the rubricCapture the expected policy boundary explicitly.
03Rerun the clusterVerify the fix before the challenger is promoted.

Sandbox S01Controlled state · real tools

Live

Sandbox S02Controlled state · real tools

Live

Sandbox S03Controlled state · real tools

Live

Sandbox S04Controlled state · real tools

Live

Sandbox S05Controlled state · real tools

Live

Sandbox S06Controlled state · real tools

Live

Test every agent change before it reaches production.

Taso replays the same production-like scenarios with your current agent and the proposed version. Using real API calls, it shows what changed in behavior, quality, cost, and latency, so you can deploy, review, or block with evidence.

Book a discovery callBook a call View a sample report

01 / BUILD THE EVAL SUITE

Build evals grounded in real behavior.

Build runnable eval environments from production traces, then combine them with private SME tasks, rubrics, datasets, and the suites your team already trusts.

01Production traces → runnable environments
02Private SME tasks, rubrics, and datasets
03Existing eval suites and datasets

02 / VERIFY THE CHANGE

Replay every agent change in production-like sandboxes.

Run the baseline and challenger through the same scenarios, real API calls, and controlled environment state. Find the exact behavior that changed before deploy.

VERIFIEDpinned eval suite · before deploy

CHANGErefund_prompt_v12 → refund_prompt_v13

NEW P0refund_policy_edge_case_17baseline pass → challenger fail

S01Live API

S02Live API

S03Live API

S04Live API

S05Live API

S06Live API

03 / RELEASE DECISION

Know what changed before you ship.

Get a deploy, review, or block decision with the divergent trace, expected behavior, and recommended fix attached.

REVIEWrefund_policy_edge_case_17

Book a discovery callBook a call View the sample reportView report

CHANGE IMPACT REPORT

A release decision with receipts.

Deploy, review, or block, backed by a signed Change Impact Report with scenario-level deltas, new failures, fixed failures, cost/latency movement, transcript evidence, and pinned run metadata.

TASO · CHANGE IMPACT REPORT

REVIEW

artifact ci_48291·suite refund_edges_v3·generated 14:02 UTC

Regression caught1 unauthorized refund path

Workflow

Support Refund Agent

Change

refund_prompt_v12 → refund_prompt_v13

Baseline

gpt-5.6-sol · tools_v4 · prompt_v12

Challenger

gpt-5.6-terra · tools_v4 · prompt_v13

Scenarios

48 scenarios · 3 trials each

Review before deploy. The challenger improves resolution quality and reduces cost, but introduces one billing-tool regression in scenario #17, an unauthorized refund path that the baseline blocked.

Top scenarios this run

Scenario	Baseline	Challenger	Delta
refund_policy_edge_case_17	pass	fail	new P0
vip_refund_override	pass	warn	+340ms
duplicate_charge_escalation	pass	pass	stable

Hover a metric above to see which scenarios drove it. Click a scenario row to open its evidence bundle.

Open the full sample report →

PUBLIC CONTRIBUTIONS

Why trust the evals?

Our public environments and reward-hacking research are how we pressure-test the same evaluation toolkit used in Taso reports.

ENVIRONMENTS

Strategy Bench

Multi-agent environments with custom metric discovery, built to surface how agents actually behave under planning, deception, cooperation, and risk. The same approach we use to design custom evals for your agent.

Open ClashAI Dataset on Hugging Face

RESEARCH

Reward Hacking

Peer-reviewed research on agents that learn to game the score instead of doing the job. The detection methods power the scorers behind every Taso report.

Read the paper

TOOLKIT

Environments, Adapters, Scorers

The building blocks we contribute back. MIT-licensed, extensible, and growing. Build on the toolkit natively or reach out and we'll help you ship it.

clashai-environments freeciv-llm hack-verifiable-environments

We're contributing to the frontier of agent evaluation. The same toolkit becomes your eval suite.

INTEGRATION

Connect the workflow you already have.

Same verdict on every integration path. No migration. No new framework.

Works with your stack

Langfuse
Braintrust
OpenTelemetry
Weights & Biases
Prime Intellect
Thinking Machines
harbor
Langfuse
Braintrust
OpenTelemetry
Weights & Biases
Prime Intellect
Thinking Machines
harbor

~/agent · taso change-impact

$ taso compare \
  --workflow support-refund-agent \
  --baseline production \
  --challenger prompt_refund_v13 \
  --suite refund_edges_v3 \
  --gate support_release_gate

✓ pinned 48-scenario suite / 3 trials
✓ replayed baseline + challenger
✓ compared quality / cost / latency / reliability
! 1 new P0 regression: refund_policy_edge_case_17

verdictREVIEW

reporttaso.run/r/ci_48291

Result

run: ci_48291
verdict: REVIEW
quality: +4.9%
cost: −9.8%
p95 latency: +220ms
new P0: 1

BEFORE YOUR NEXT RELEASE

Catch an agent failure before your customers do.

In 30 minutes, we'll map your agent workflow to a pilot and show you the change report your team would receive before deployment.