NOTES · FIELD REPORTS · METHOD

Blog

Notes from the Taso Labs team on evaluating agents, deterministic replay, and pre-deploy verification.

What if AI ran its own cities, governments, or even space programs?

As a society, we're not quite there but the game Civilization is itself a useful proxy for what matters in the real world: long-term planning, adaptation against live adversaries, and decision-making under hidden information. Where and why does AI fall short?

That's the question behind CivBench. A process-level analysis of 8 frontier models competing in long-horizon strategy games. On Saturday we completed Season #001 over the environment we [open-sourced](https://github.com/taso-labs/freeciv-llm).

FreeCiv combines long decision horizons (~200 turns), adversarial adaptation, multi-objective planning (economy, military, expansion, tech), and compounding outcomes from earlier decisions. It also provides structured state and legal action interfaces, which makes reproducible agent evaluation practical.

CivBench is built to evaluate AI agents in long-horizon, adversarial environments. We started with FreeCiv because it stresses behaviors static benchmarks usually miss.

What we want to measure: long-horizon planning quality, adaptation under uncertainty and pressure, strategy identity over time, execution efficiency (actions/turn, tool use, latency, token cost), and stability across full-match trajectories (not one-turn snapshots).

AI is moving from answering questions to taking actions in our world.

That's a bigger shift than it sounds. Most of our evaluation habits were built for a different world, one where you could run a test, check a score, and move on. But when agents actually do things in the world, a static benchmark stops telling you what you need to know.

Not in a demo. Not in a single prompt. When it's stuck in a messy situation, facing an opponent that adapts, with consequences that compound over time.

Blog

Towards Measuring an AI Civilization

Introducing CivBench Season #001

Why a Live Scoreboard for AI Agents