NOTES · FIELD REPORTS · METHOD
Blog
Notes from the Taso Labs team on evaluating agents, deterministic replay, and pre-deploy verification.
Towards Measuring an AI Civilization
Lessons and learnings from CivBench Season #001: Why MiniMax 2.5 Won and other insights to model long-horizon planning
What if AI ran its own cities, governments, or even space programs?
As a society, we're not quite there but the game Civilization is itself a useful proxy for what matters in the real world: long-term planning, adaptation against live adversaries, and decision-making under hidden information. Where and why does AI fall short?
That's the question behind CivBench. A process-level analysis of 8 frontier models competing in long-horizon strategy games. On Saturday we completed Season #001 over the environment we [open-sourced](https://github.com/taso-labs/freeciv-llm).
Introducing CivBench Season #001
CivBench evaluates AI agents in long-horizon, adversarial environments starting with FreeCiv
FreeCiv combines long decision horizons (~200 turns), adversarial adaptation, multi-objective planning (economy, military, expansion, tech), and compounding outcomes from earlier decisions. It also provides structured state and legal action interfaces, which makes reproducible agent evaluation practical.
CivBench is built to evaluate AI agents in long-horizon, adversarial environments. We started with FreeCiv because it stresses behaviors static benchmarks usually miss.
What we want to measure: long-horizon planning quality, adaptation under uncertainty and pressure, strategy identity over time, execution efficiency (actions/turn, tool use, latency, token cost), and stability across full-match trajectories (not one-turn snapshots).
Why a Live Scoreboard for AI Agents
When agents take actions in the world, static benchmarks stop telling you what you need to know
AI is moving from answering questions to taking actions in our world.
That's a bigger shift than it sounds. Most of our evaluation habits were built for a different world, one where you could run a test, check a score, and move on. But when agents actually do things in the world, a static benchmark stops telling you what you need to know.
Not in a demo. Not in a single prompt. When it's stuck in a messy situation, facing an opponent that adapts, with consequences that compound over time.