
Cooperation failures in multi-agent interactions could lead to catastrophic outcomes even among aligned AI agents. Classic cooperation problems such as the Prisoner’s Dilemma or the Tragedy of the Commons have been useful for illustrating and exploring this challenge, but toy experiments with current language models cannot provide robust evidence for how advanced agents will behave in real-world settings. To better understand how to prevent cooperation failures among AI agents we propose a shift in focus from simulating entire scenarios to studying specific agent properties. If we can (1) understand the causal relationships between properties of agents and outcomes in multi-agent interactions and (2) evaluate to what extent those properties are present in agents, this provides a path towards actionable results that could inform agent design and regulation. We provide an initial list of agent properties to be considered as targets for such work.
The dynamics of AI interactions are rapidly changing in two important ways. Firstly, as more agentic systems are deployed, there is a shift from dyadic (between a single model and a single user) to multi-agent interactions (potentially involving many different systems, agents and humans). Secondly, and as a result of the first change, the nature of these interactions changes from predominantly cooperative to mixed-motive (Hammond et al. 2025).
A well-known challenge of mixed-motive interactions is cooperation problems: situations where rational agents fail to achieve mutually beneficial outcomes. In human societies there are many mechanisms in place that mitigate such effects, from biological traits that favour cooperation to social norms and formal laws and institutions (Melis and Semmann 2010). As more agentic AI systems are deployed, it is important to note that corresponding mechanisms are mostly not in place to make their mixed-motive interactions safe.
Cooperation failures have been extensively studied in game theory. Canonical problems such as the Prisoners' Dilemma and the Tragedy of the Commons are therefore natural starting points for work on cooperative AI, and such setups have been explored both in the context of multi-agent reinforcement learning (Perolat et al. 2017; Haupt et al. 2024; Sandholm and Crites 1996; Babes et al. 2008) and large language models (Piatti et al. 2024; Chan et al. 2023; Fontana et al. 2024; Brookins and DeBacker 2024; Akata et al. 2025). A benefit of such experiments is that the observed results can be classified in a relatively straightforward way as cooperative or uncooperative, or as more or less fair.
There are, however, several reasons why cooperative behaviour in these experiments is insufficient to make us confidently expect cooperative behaviour in real-world situations. Firstly, these cooperation scenarios and experimental environments tend to be overly simplistic. While simplicity and close resemblance to classical game-theoretic problems makes analysing the results of such experiments easier, it also makes generalisation to more complex settings hard. Using complex cooperation scenarios, on the other hand, makes it difficult to interpret results and to distinguish cooperativeness from more generic reasoning capabilities (Hua et al. 2024).
Secondly, current LLM-based agents are highly context-dependent. When agent behaviour is heavily influenced by variations in framing that are independent of the game-theoretic concepts that we aim to study, this severely limits the usefulness of the results (Lorè and Heydari 2023). Relatedly, these classical games are well-known and can be expected to occur in the training data, which should also be expected to influence agent behaviour in ways that might not generalise to more complex settings.
This leads to the third issue, which is that current LLM-based agents are not very strategic or goal-oriented (Hua et al. 2024). If an evaluation or benchmark is centred around agents pursuing different goals but the agents in question are not sufficiently competent at pursuing such goals, observations of “cooperation” or “defection” might not in fact indicate any meaningful strategic intentions. While we can expect that the strategic capabilities of agents will improve, there is a risk in the short term that cooperation resulting from such strategic incompetence could lead to false reassurance if it is interpreted as evidence of safe behaviour in a broader sense. Unless we thoroughly understand what causes defection or cooperation in an experiment we cannot draw conclusions about how agents would behave in real-world settings.
Finally, evaluation through simulated cooperation scenarios makes it difficult to control for ‘situational awareness’. Recent work points to LLM behaviour changing when the model recognises the setting as an evaluation (Kovarik et al. 2025; Schoen et al. 2025; Greenblatt et al. 2024; Phuong et al. 2025; Needham et al. 2025), and this risk seems particularly salient in simulations of scenarios that are very different from settings in which LLM agents are currently deployed.
We propose an alternative approach to measuring cooperative or un-cooperative behaviour directly, which is to break this behaviour down into constituent capabilities and propensities. The scope here is limited to the properties of the agents, while acknowledging that external infrastructure will also be important to ensure safe agent interactions (Chan et al. 2025).
Evaluation of cooperativeness through constituent capabilities and propensities depends on two complementary types of work:
This approach has several advantages compared to simulation of cooperation problem scenarios. When we focus on one property at a time, that property can more easily be systematically assessed using many different and complementary scenarios or tasks. This makes it possible to achieve more robust results without necessarily escalating task complexity too much, and to isolate the effects of reasoning capabilities, framing variations and other confounding variables (Mallen et al. 2025). We can also more systematically reason about plausible dependencies on goal-directedness, and observe how a propensity for a given behaviour correlates with increasing general capabilities (Oesterheld et al. 2025; Ren et al. 2024). At least in some cases, it also seems more feasible to construct tasks that are difficult to recognise as evaluations, mitigating the challenge of situational awareness.
We propose five clusters of agent properties, listed in Table 1, as relevant for predicting the outcomes of multi-agent interactions2. Importantly, there is no claim here about whether the listed properties would be desirable or not, as this will depend both on the combination of properties that an agent has and on the context of its deployment.
We also recognise that for certain properties, the financial incentives for development and deployment are strong and the prospects for regulating them are limited. The central question might then not be “is this a property that we should allow?” but rather “how can we mitigate the risks as agents with these properties are deployed?”.
While the cluster headings are intended to cover the agent properties that are important for predicting outcomes in cooperation problems, the properties listed under each are not meant to be exhaustive and many of them are overlapping and interconnected. For example, credibility in communication builds to a large extent on the ability to model the agent you communicate with; trustworthiness can rely on consistency and accountability; exploitability may be heavily influenced by the baseline assumptions an agent makes about unknown properties of other agents. Such connections are important to consider when developing evaluations for AI agents. In particular, we need to consider how results may be influenced by strategic (in)competence when other properties are assessed.
In practice, we expect that the definition of each specific property will need to be refined and potentially broken down to a finer granularity in the process of evaluation. In the case of normative compliance, it may, for example, be important to distinguish instrumental compliance where the agent strategically avoids the consequences of norm transgressions, from compliant behaviour that is independent of such explicit strategising. Properties may also manifest differently depending on the qualities of the other agents; an agent may exhibit a certain property in one group but not another, and it is important to identify such phenomena to understand how results will or will not generalise.
Many properties can also be considered either from the perspective of capabilities or propensities. “Capabilities” refers to behaviours or actions that an autonomous agent is able to perform while “propensities” refers to the tendency of an agent to express one behaviour over another. A clear example is deception, where the capability of deception is distinct from the propensity to deceive. Propensities and fundamental motivational drivers in current models can be expected to mainly be steered by fine-tuning for harmless and helpful behaviours, and this is likely not very informative of how future agents that are trained for different purposes would behave. Monitoring how these properties develop with increasing goal-directedness and decreasing exploitability will however be key to understanding what multi-agent dynamics can be expected to emerge over time.
There are great financial incentives for the development and deployment of more agentic AI, and if this is done without consideration for multi-agent safety the result could be catastrophic (Hammond et al., 2025). At the same time, it is intrinsically challenging to do meaningful work on risks that have not yet materialised, and that are rooted in properties that current models do not display. While some of the behaviours that are strongly related to multi-agent risks are unlikely to arise until agents become more consistently goal-directed and less exploitable, this should not be taken as an argument for inaction until such agents arrive. We propose studying the constituent properties that predict agent behaviour in cooperation problems as a tractable approach to make progress on multi-agent safety.
An important uncertainty with this approach is how sensitive it is to the precise way agent properties are specified. If some key properties are neglected, misunderstood, or ill-defined, does this render past analysis obsolete? Simulations of scenarios with cooperation challenges may therefore be valuable complements to property-based evaluations, as they could confirm that the outcomes in multi-agent interactions turn out as predicted among agents with a specific set of properties. Using scenario simulations in such hypothesis-testing experiments is, however, quite different from making them the basis of agent evaluations.
Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. “Playing Repeated Games with Large Language Models.” Nature Human Behaviour 9 (7): 1380–90. https://doi.org/10.1038/s41562-025-02172-y.
Babes, Monica, Enrique Munoz de Cote, and Michael L. Littman. 2008. “Social Reward Shaping in the Prisoner’s Dilemma.” Paper presented at Proceedings of the International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS) (01/01/08). https://eprints.soton.ac.uk/266919/.
Brookins, Philip, and Jason DeBacker. 2024. “Playing Games with GPT: What Can We Learn about a Large Language Model from Canonical Strategic Games?” Economics Bulletin 44 (1): 25–37.
Chan, Alan, Maxime Riché, and Jesse Clifton. 2023. “Towards the Scalable Evaluation of Cooperativeness in Language Models.” arXiv:2303.13360. Preprint, arXiv, March 16. https://doi.org/10.48550/arXiv.2303.13360.
Chan, Alan, Kevin Wei, Sihao Huang, et al. 2025. “Infrastructure for AI Agents.” arXiv:2501.10114. Preprint, arXiv, June 19. https://doi.org/10.48550/arXiv.2501.10114.
Fontana, Nicoló, Francesco Pierri, and Luca Maria Aiello. 2024. “Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?” arXiv:2406.13605. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2406.13605.
Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. 2024. “Alignment Faking in Large Language Models.” arXiv:2412.14093. Preprint, arXiv, December 20. https://doi.org/10.48550/arXiv.2412.14093.
Hammond, Lewis, Alan Chan, Jesse Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.” arXiv:2502.14143. Preprint, arXiv, February 19. https://doi.org/10.48550/arXiv.2502.14143.
Haupt, Andreas A., Phillip J. K. Christoffersen, Mehul Damani, and Dylan Hadfield-Menell. 2024. “Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL.” arXiv:2208.10469. Preprint, arXiv, January 29. https://doi.org/10.48550/arXiv.2208.10469.
Hua, Wenyue, Ollie Liu, Lingyao Li, et al. 2024. “Game-Theoretic LLM: Agent Workflow for Negotiation Games.” arXiv:2411.05990. Preprint, arXiv, November 12. https://doi.org/10.48550/arXiv.2411.05990.
Kovarik, Vojtech, Eric Olav Chen, Sami Petersen, Alexis Ghersengorin, and Vincent Conitzer. 2025. “AI Testing Should Account for Sophisticated Strategic Behaviour.” arXiv:2508.14927. Preprint, arXiv, August 19. https://doi.org/10.48550/arXiv.2508.14927.
Lorè, Nunzio, and Babak Heydari. 2023. “Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing.” arXiv:2309.05898. Preprint, arXiv, September 12. https://doi.org/10.48550/arXiv.2309.05898.
Mallen, Alex, Charlie Griffin, Misha Wagner, Alessandro Abate, and Buck Shlegeris. 2025. “Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?” arXiv:2412.12480. Preprint, arXiv, April 4. https://doi.org/10.48550/arXiv.2412.12480.
Melis, Alicia P., and Dirk Semmann. 2010. “How Is Human Cooperation Different?” Philosophical Transactions of the Royal Society B: Biological Sciences 365 (1553): 2663–74. https://doi.org/10.1098/rstb.2010.0157.
Needham, Joe, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. 2025. “Large Language Models Often Know When They Are Being Evaluated.” arXiv:2505.23836. Preprint, arXiv, July 16. https://doi.org/10.48550/arXiv.2505.23836.
Oesterheld, Caspar, Emery Cooper, Miles Kodama, Linh Chi Nguyen, and Ethan Perez. 2025. “A Dataset of Questions on Decision-Theoretic Reasoning in Newcomb-like Problems.” arXiv:2411.10588. Preprint, arXiv, June 16. https://doi.org/10.48550/arXiv.2411.10588.
Perolat, Julien, Joel Z. Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. 2017. “A Multi-Agent Reinforcement Learning Model of Common-Pool Resource Appropriation.” arXiv:1707.06600. Preprint, arXiv, September 6. https://doi.org/10.48550/arXiv.1707.06600.
Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. 2025. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv:2505.01420. Preprint, arXiv, July 3. https://doi.org/10.48550/arXiv.2505.01420.
Piatti, Giorgio, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. “Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents.” arXiv:2404.16698. Preprint, arXiv, December 8. https://doi.org/10.48550/arXiv.2404.16698.
Ren, Richard, Steven Basart, Adam Khoja, et al. 2024. “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv:2407.21792. Preprint, arXiv, December 27. https://doi.org/10.48550/arXiv.2407.21792.
Sandholm, Tuomas W., and Robert H. Crites. 1996. “Multiagent Reinforcement Learning in the Iterated Prisoner’s Dilemma.” Biosystems 37 (1): 147–66. https://doi.org/10.1016/0303-2647(95)01551-5.
Schoen, Bronson, Evgenia Nitishinskaya, Mikita Balesni, et al. 2025. “Stress Testing Deliberative Alignment for Anti-Scheming Training.” arXiv:2509.15541. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2509.15541.
