It’s interesting seeing the gameplay logic broken down like this. From playing civ for many years my (human) early game strategy feels effortless. Cool to see the stark contrast with an agent going through the rigorous analysis:
Turn 29: I have 3 cities but TWO are undefended (Antium and Westhold) - critical risk. I have 17 warriors, many with 0 moves already used this turn. I need to garrison the undefended cities immediately. Unit #125 is at (6,57) with moves and goto - it's near Westhold (7,57), move it there. For Antium (13,59), I need to find a warrior with moves near it. Looking at warriors with moves: #125 at (6,57), #120, #122, #124, #132, #133, #136, #144, #145, #146 - need to check which have moves remaining. Units #128, #137, #141, #127 have 0 moves. Let me move #125 to Westhold and find someone for Antium. I'll also fortify units that are in cities, move explorers, and set worker #107 to work. Need to research tech too. With 17 warriors I should be exploring aggressively and preparing for expansion. I'll move warriors toward undefended cities and fortify defenders.
I've been thinking about how we can orchestrate the long-term planning logic better in this benchmark too, similar to how claude code has a planning step, maybe every X turns we introduce a planning calibrartion step much how like people are able to plan for multi-step turns.
Ie. we often see the same logic repeat:
"Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense."
"Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense. I have a massive army of warriors that should be
and just earlier
"Turn 68: I have 4 cities, opponent location unknown. Critical: Southgate (7,60) is undefended - Phalanx #167 is at (7,60), so I need to fortify it there. I have 23 military units but no enemy sighted yet. Priority: 1) Garrison Southgate with phalanx #167, 2) Fortify defenders in cities, 3) ..."
Opus 4.6 just dropped, so we’re tossing it straight into the arena.
CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound.
In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live.
One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API.
After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live.
Feedback welcome, especially from people working on agent evals or RL.
We have a standard harness for each of the model's that we test. Each prompt includes the rules, access to memory, and a lookup of the complete ruleset. The prompt adapts adding legal actions per turn and guidance depending on the stage of the game (updated based on the technological progress of the player).
Unlike RL algorithms these LLMs wouldn't learn quick enough without the prior knowledge the harness provides
using a complex environment like freeciv as your vehicle for benchmarking is impressive. but it also means you have a lot of confounding variables at play. how do you extract meaningful capability insight from that as opposed to simpler benchmarks like MMLU or GSM8K
Like you said, theres a lot of complexity in the decision making here. To have statistically significant results we need to run these simulations many times. We record latency, tool calls, token consumption, etc. as well as results. Since we log the actions and their final outcomes we can run analysis later on the decisions correlations with success here. Our hypothesis is games provide an important benchmark for how these models will adapt in intelligence as they become more capable.
For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true
It’s interesting seeing the gameplay logic broken down like this. From playing civ for many years my (human) early game strategy feels effortless. Cool to see the stark contrast with an agent going through the rigorous analysis:
Turn 29: I have 3 cities but TWO are undefended (Antium and Westhold) - critical risk. I have 17 warriors, many with 0 moves already used this turn. I need to garrison the undefended cities immediately. Unit #125 is at (6,57) with moves and goto - it's near Westhold (7,57), move it there. For Antium (13,59), I need to find a warrior with moves near it. Looking at warriors with moves: #125 at (6,57), #120, #122, #124, #132, #133, #136, #144, #145, #146 - need to check which have moves remaining. Units #128, #137, #141, #127 have 0 moves. Let me move #125 to Westhold and find someone for Antium. I'll also fortify units that are in cities, move explorers, and set worker #107 to work. Need to research tech too. With 17 warriors I should be exploring aggressively and preparing for expansion. I'll move warriors toward undefended cities and fortify defenders.
I've been thinking about how we can orchestrate the long-term planning logic better in this benchmark too, similar to how claude code has a planning step, maybe every X turns we introduce a planning calibrartion step much how like people are able to plan for multi-step turns.
Ie. we often see the same logic repeat: "Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense."
"Turn 70: I have 4 cities with 24 military units and 3 workers. Critical issues: Roma and Antium are flagged as undefended. I see phalanx #160 at Roma (10,58) and phalanx #171 at Antium (13,59) - they need to fortify for defense. I have a massive army of warriors that should be
and just earlier "Turn 68: I have 4 cities, opponent location unknown. Critical: Southgate (7,60) is undefended - Phalanx #167 is at (7,60), so I need to fortify it there. I have 23 military units but no enemy sighted yet. Priority: 1) Garrison Southgate with phalanx #167, 2) Fortify defenders in cities, 3) ..."
Opus 4.6 just dropped, so we’re tossing it straight into the arena.
CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound.
In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live.
One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API.
After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live.
Feedback welcome, especially from people working on agent evals or RL.
What sort of context do you give the APIs when you are starting the game? Does it need to learn the rules as it goes?
We have a standard harness for each of the model's that we test. Each prompt includes the rules, access to memory, and a lookup of the complete ruleset. The prompt adapts adding legal actions per turn and guidance depending on the stage of the game (updated based on the technological progress of the player).
Unlike RL algorithms these LLMs wouldn't learn quick enough without the prior knowledge the harness provides
what do you use for memory?
tool call over redis for now, would be cool to experiment with different context/memory management systems for the agents though!
using a complex environment like freeciv as your vehicle for benchmarking is impressive. but it also means you have a lot of confounding variables at play. how do you extract meaningful capability insight from that as opposed to simpler benchmarks like MMLU or GSM8K
Like you said, theres a lot of complexity in the decision making here. To have statistically significant results we need to run these simulations many times. We record latency, tool calls, token consumption, etc. as well as results. Since we log the actions and their final outcomes we can run analysis later on the decisions correlations with success here. Our hypothesis is games provide an important benchmark for how these models will adapt in intelligence as they become more capable.
For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true
Super interesting thanks for sharing
thanks for checking it out, let me know if there's other game environments you'd want to see!
betting wen?
we need polymarket !
polymarket market soon??