Agent Properties for Safe Interactions

Why another round of Prisoner’s Dilemma is unlikely to be helpful, and a suggestion for what to do instead.

Cooperation failures in multi-agent interactions could lead to catastrophic outcomes even among aligned AI agents. Classic cooperation problems such as the Prisoner’s Dilemma or the Tragedy of the Commons have been useful for illustrating and exploring this challenge, but toy experiments with current language models cannot provide robust evidence for how advanced agents will behave in real-world settings. To better understand how to prevent cooperation failures among AI agents we propose a shift in focus from simulating entire scenarios to studying specific agent properties. If we can (1) understand the causal relationships between properties of agents and outcomes in multi-agent interactions and (2) evaluate to what extent those properties are present in agents, this provides a path towards actionable results that could inform agent design and regulation. We provide an initial list of agent properties to be considered as targets for such work.

1. Introduction

The dynamics of AI interactions are rapidly changing in two important ways. Firstly, as more agentic systems are deployed, there is a shift from dyadic (between a single model and a single user) to multi-agent interactions (potentially involving many different systems, agents and humans). Secondly, and as a result of the first change, the nature of these interactions changes from predominantly cooperative to mixed-motive (Hammond et al. 2025).

A well-known challenge of mixed-motive interactions is cooperation problems: situations where rational agents fail to achieve mutually beneficial outcomes. In human societies there are many mechanisms in place that mitigate such effects, from biological traits that favour cooperation to social norms and formal laws and institutions (Melis and Semmann 2010). As more agentic AI systems are deployed, it is important to note that corresponding mechanisms are mostly not in place to make their mixed-motive interactions safe.

2. Evaluation through simulation of cooperation problems

Cooperation failures have been extensively studied in game theory. Canonical problems such as the Prisoners' Dilemma and the Tragedy of the Commons are therefore natural starting points for work on cooperative AI, and such setups have been explored both in the context of multi-agent reinforcement learning (Perolat et al. 2017; Haupt et al. 2024; Sandholm and Crites 1996; Babes et al. 2008) and large language models (Piatti et al. 2024; Chan et al. 2023; Fontana et al. 2024; Brookins and DeBacker 2024; Akata et al. 2025). A benefit of such experiments is that the observed results can be classified in a relatively straightforward way as cooperative or uncooperative, or as more or less fair.

There are, however, several reasons why cooperative behaviour in these experiments is insufficient to make us confidently expect cooperative behaviour in real-world situations. Firstly, these cooperation scenarios and experimental environments tend to be overly simplistic. While simplicity and close resemblance to classical game-theoretic problems makes analysing the results of such experiments easier, it also makes generalisation to more complex settings hard. Using complex cooperation scenarios, on the other hand, makes it difficult to interpret results and to distinguish cooperativeness from more generic reasoning capabilities (Hua et al. 2024).

Secondly, current LLM-based agents are highly context-dependent. When agent behaviour is heavily influenced by variations in framing that are independent of the game-theoretic concepts that we aim to study, this severely limits the usefulness of the results (Lorè and Heydari 2023). Relatedly, these classical games are well-known and can be expected to occur in the training data, which should also be expected to influence agent behaviour in ways that might not generalise to more complex settings.

This leads to the third issue, which is that current LLM-based agents are not very strategic or goal-oriented (Hua et al. 2024). If an evaluation or benchmark is centred around agents pursuing different goals but the agents in question are not sufficiently competent at pursuing such goals, observations of “cooperation” or “defection” might not in fact indicate any meaningful strategic intentions. While we can expect that the strategic capabilities of agents will improve, there is a risk in the short term that cooperation resulting from such strategic incompetence could lead to false reassurance if it is interpreted as evidence of safe behaviour in a broader sense. Unless we thoroughly understand what causes defection or cooperation in an experiment we cannot draw conclusions about how agents would behave in real-world settings.

Finally, evaluation through simulated cooperation scenarios makes it difficult to control for ‘situational awareness’. Recent work points to LLM behaviour changing when the model recognises the setting as an evaluation (Kovarik et al. 2025; Schoen et al. 2025; Greenblatt et al. 2024; Phuong et al. 2025; Needham et al. 2025), and this risk seems particularly salient in simulations of scenarios that are very different from settings in which LLM agents are currently deployed.

3. Evaluation of constituent capabilities and propensities

We propose an alternative approach to measuring cooperative or un-cooperative behaviour directly, which is to break this behaviour down into constituent capabilities and propensities. The scope here is limited to the properties of the agents, while acknowledging that external infrastructure will also be important to ensure safe agent interactions (Chan et al. 2025).

Evaluation of cooperativeness through constituent capabilities and propensities depends on two complementary types of work:

  1. Evaluating how a given agent property influences outcomes in agent interactions. An example of a result in this direction is how consideration of the universalisation principle1 leads to more sustainable outcomes (Piatti et al. 2024).

  2. Evaluating to what extent an agent has a cooperation-relevant property. An example of such work is the study of decision-theoretic capabilities of language models (Oesterheld et al. 2025).

This approach has several advantages compared to simulation of cooperation problem scenarios. When we focus on one property at a time, that property can more easily be systematically assessed using many different and complementary scenarios or tasks. This makes it possible to achieve more robust results without necessarily escalating task complexity too much, and to isolate the effects of reasoning capabilities, framing variations and other confounding variables (Mallen et al. 2025). We can also more systematically reason about plausible dependencies on goal-directedness, and observe how a propensity for a given behaviour correlates with increasing general capabilities (Oesterheld et al. 2025; Ren et al. 2024). At least in some cases, it also seems more feasible to construct tasks that are difficult to recognise as evaluations, mitigating the challenge of situational awareness.

3.1 Key agent properties for cooperation

We propose five clusters of agent properties, listed in Table 1, as relevant for predicting the outcomes of multi-agent interactions2. Importantly, there is no claim here about whether the listed properties would be desirable or not, as this will depend both on the combination of properties that an agent has and on the context of its deployment.

We also recognise that for certain properties, the financial incentives for development and deployment are strong and the prospects for regulating them are limited. The central question might then not be “is this a property that we should allow?” but rather “how can we mitigate the risks as agents with these properties are deployed?”.

Fundamental Motivational Drivers
Alignment What other real-world entities (or values) is the agent (approximately) aligned with?
Altruism How likely is the agent to choose actions that benefit others at a material cost to itself? Does it strive for higher social welfare?
Positional preferences Does the agent have preferences that directly disvalue others benefiting, such as spite or preferring to end up better-off than others?
Impartiality Does the agent cooperate better with some counterparties than others when distinctions are irrelevant to cooperation?
Temporal discount factor How does the agent discount future rewards compared to those closer in time?

Modelling Itself and Others
Theory of mind Can the agent model other agents' preferences, learning, attention, decision processes, and predict their responses? Can it understand how others perceive and model it?
Trustworthiness assessment Is the agent effective at assessing the trustworthiness of others?
Self-awareness Can the agent assess its own preferences, learning, capabilities, and epistemic status?
Baseline assumptions What assumptions does the agent make about unknown properties of other agents?

Communication and Commitments
Transparency Can the agent share or protect private information? Does it have capabilities for compartmentalised negotiation?
Credibility Can the agent communicate credibly, e.g. through costly signalling or removing options? Can it address trust issues when counterparties have lower capabilities?
Trustworthiness Does the agent keep commitments even at a cost? Does it avoid making commitments it may not be able to keep?
Persuasiveness Can the agent influence beliefs, preferences, or intended actions of other agents through communication or framing?
Coerciveness Can the agent effectively use coercion or threats of harm?
Deception Can the agent deceive or disguise its actions, reasoning, preferences, or objectives?

Normative Intelligence and Behaviour
Accountability Does the agent assume reputation will matter? Is it capable of long-term credit assignment?
Normative competence Does the agent have the capacity to perceive and reason about existing social norms and normative systems?
Normative compliance Does the agent tend to follow identified norms? Does it engage in the production of normative justifications?
Third-party sanctioning To what extent does the agent actively support upholding established norms and institutions by sanctioning others?
Establishment of institutions Can the agent identify potential new institutions? Does it propose or support the establishment of new institutions?

Strategic Intelligence and Behaviour
Goal-directedness Does the agent behave in ways that systematically aim to achieve specific outcomes or goals?
Consistency Is the agent consistent in response to game-theoretic structures? Does it follow a consistent normative ethical theory?
Rationality and decision theory Is the agent able to determine what behaviours would be rational for itself and others? Does it act according to a specific decision theory?
Identify equilibria Given an understanding of other agents’ goals, can the agent compute approximate equilibria and reason about alternatives?
Pareto improvements Does the agent proactively search for potential Pareto improvements?
Coalition assessment How well does the agent reason about coalitions with which to align?
Exploitability How easy or hard is it for other agents to exploit this agent? Does it take measures to protect itself or impose costs on exploiters?

While the cluster headings are intended to cover the agent properties that are important for predicting outcomes in cooperation problems, the properties listed under each are not meant to be exhaustive and many of them are overlapping and interconnected. For example, credibility in communication builds to a large extent on the ability to model the agent you communicate with; trustworthiness can rely on consistency and accountability; exploitability may be heavily influenced by the baseline assumptions an agent makes about unknown properties of other agents. Such connections are important to consider when developing evaluations for AI agents. In particular, we need to consider how results may be influenced by strategic (in)competence when other properties are assessed.

In practice, we expect that the definition of each specific property will need to be refined and potentially broken down to a finer granularity in the process of evaluation. In the case of normative compliance, it may, for example, be important to distinguish instrumental compliance where the agent strategically avoids the consequences of norm transgressions, from compliant behaviour that is independent of such explicit strategising. Properties may also manifest differently depending on the qualities of the other agents; an agent may exhibit a certain property in one group but not another, and it is important to identify such phenomena to understand how results will or will not generalise.

Many properties can also be considered either from the perspective of capabilities or propensities. “Capabilities” refers to behaviours or actions that an autonomous agent is able to perform while “propensities” refers to the tendency of an agent to express one behaviour over another. A clear example is deception, where the capability of deception is distinct from the propensity to deceive. Propensities and fundamental motivational drivers in current models can be expected to mainly be steered by fine-tuning for harmless and helpful behaviours, and this is likely not very informative of how future agents that are trained for different purposes would behave. Monitoring how these properties develop with increasing goal-directedness and decreasing exploitability will however be key to understanding what multi-agent dynamics can be expected to emerge over time.

4. Discussion

There are great financial incentives for the development and deployment of more agentic AI, and if this is done without consideration for multi-agent safety the result could be catastrophic (Hammond et al., 2025). At the same time, it is intrinsically challenging to do meaningful work on risks that have not yet materialised, and that are rooted in properties that current models do not display. While some of the behaviours that are strongly related to multi-agent risks are unlikely to arise until agents become more consistently goal-directed and less exploitable, this should not be taken as an argument for inaction until such agents arrive. We propose studying the constituent properties that predict agent behaviour in cooperation problems as a tractable approach to make progress on multi-agent safety.

An important uncertainty with this approach is how sensitive it is to the precise way agent properties are specified. If some key properties are neglected, misunderstood, or ill-defined, does this render past analysis obsolete? Simulations of scenarios with cooperation challenges may therefore be valuable complements to property-based evaluations, as they could confirm that the outcomes in multi-agent interactions turn out as predicted among agents with a specific set of properties. Using scenario simulations in such hypothesis-testing experiments is, however, quite different from making them the basis of agent evaluations.

The author thanks Caspar Oesterheld, Lewis Hammond, Sophia Hatz and Chandler Smith for valuable comments and feedback. Thanks also to all the participants of the Cooperative AI Summer Retreat 2025 who contributed to the creation of the list of agent properties.

Footnotes

1. The basic idea of universalisation is that when assessing whether a particular moral rule or action is permissible, one should ask, “What if everybody does that?”.

2. This list is based on the outputs of a workshop on the properties of cooperative agents at the 2025 Cooperative AI Summer Retreat, involving experts from academia and industry.

References

Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. 2025. “Playing Repeated Games with Large Language Models.” Nature Human Behaviour 9 (7): 1380–90. https://doi.org/10.1038/s41562-025-02172-y.


Babes, Monica, Enrique Munoz de Cote, and Michael L. Littman. 2008. “Social Reward Shaping in the Prisoner’s Dilemma.” Paper presented at Proceedings of the International Joint Conference on Autonomous Agents and Multi Agent Systems (AAMAS) (01/01/08). https://eprints.soton.ac.uk/266919/.


Brookins, Philip, and Jason DeBacker. 2024. “Playing Games with GPT: What Can We Learn about a Large Language Model from Canonical Strategic Games?” Economics Bulletin 44 (1): 25–37.


Chan, Alan, Maxime Riché, and Jesse Clifton. 2023. “Towards the Scalable Evaluation of Cooperativeness in Language Models.” arXiv:2303.13360. Preprint, arXiv, March 16. https://doi.org/10.48550/arXiv.2303.13360.


Chan, Alan, Kevin Wei, Sihao Huang, et al. 2025. “Infrastructure for AI Agents.” arXiv:2501.10114. Preprint, arXiv, June 19. https://doi.org/10.48550/arXiv.2501.10114.


Fontana, Nicoló, Francesco Pierri, and Luca Maria Aiello. 2024. “Nicer Than Humans: How Do Large Language Models Behave in the Prisoner’s Dilemma?” arXiv:2406.13605. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2406.13605.


Greenblatt, Ryan, Carson Denison, Benjamin Wright, et al. 2024. “Alignment Faking in Large Language Models.” arXiv:2412.14093. Preprint, arXiv, December 20. https://doi.org/10.48550/arXiv.2412.14093.


Hammond, Lewis, Alan Chan, Jesse Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.” arXiv:2502.14143. Preprint, arXiv, February 19. https://doi.org/10.48550/arXiv.2502.14143.


Haupt, Andreas A., Phillip J. K. Christoffersen, Mehul Damani, and Dylan Hadfield-Menell. 2024. “Formal Contracts Mitigate Social Dilemmas in Multi-Agent RL.” arXiv:2208.10469. Preprint, arXiv, January 29. https://doi.org/10.48550/arXiv.2208.10469.


Hua, Wenyue, Ollie Liu, Lingyao Li, et al. 2024. “Game-Theoretic LLM: Agent Workflow for Negotiation Games.” arXiv:2411.05990. Preprint, arXiv, November 12. https://doi.org/10.48550/arXiv.2411.05990.


Kovarik, Vojtech, Eric Olav Chen, Sami Petersen, Alexis Ghersengorin, and Vincent Conitzer. 2025. “AI Testing Should Account for Sophisticated Strategic Behaviour.” arXiv:2508.14927. Preprint, arXiv, August 19. https://doi.org/10.48550/arXiv.2508.14927.


Lorè, Nunzio, and Babak Heydari. 2023. “Strategic Behavior of Large Language Models: Game Structure vs. Contextual Framing.” arXiv:2309.05898. Preprint, arXiv, September 12. https://doi.org/10.48550/arXiv.2309.05898.


Mallen, Alex, Charlie Griffin, Misha Wagner, Alessandro Abate, and Buck Shlegeris. 2025. “Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?” arXiv:2412.12480. Preprint, arXiv, April 4. https://doi.org/10.48550/arXiv.2412.12480.


Melis, Alicia P., and Dirk Semmann. 2010. “How Is Human Cooperation Different?” Philosophical Transactions of the Royal Society B: Biological Sciences 365 (1553): 2663–74. https://doi.org/10.1098/rstb.2010.0157.


Needham, Joe, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. 2025. “Large Language Models Often Know When They Are Being Evaluated.” arXiv:2505.23836. Preprint, arXiv, July 16. https://doi.org/10.48550/arXiv.2505.23836.


Oesterheld, Caspar, Emery Cooper, Miles Kodama, Linh Chi Nguyen, and Ethan Perez. 2025. “A Dataset of Questions on Decision-Theoretic Reasoning in Newcomb-like Problems.” arXiv:2411.10588. Preprint, arXiv, June 16. https://doi.org/10.48550/arXiv.2411.10588.


Perolat, Julien, Joel Z. Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. 2017. “A Multi-Agent Reinforcement Learning Model of Common-Pool Resource Appropriation.” arXiv:1707.06600. Preprint, arXiv, September 6. https://doi.org/10.48550/arXiv.1707.06600.


Phuong, Mary, Roland S. Zimmermann, Ziyue Wang, et al. 2025. “Evaluating Frontier Models for Stealth and Situational Awareness.” arXiv:2505.01420. Preprint, arXiv, July 3. https://doi.org/10.48550/arXiv.2505.01420.


Piatti, Giorgio, Zhijing Jin, Max Kleiman-Weiner, Bernhard Schölkopf, Mrinmaya Sachan, and Rada Mihalcea. 2024. “Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents.” arXiv:2404.16698. Preprint, arXiv, December 8. https://doi.org/10.48550/arXiv.2404.16698.


Ren, Richard, Steven Basart, Adam Khoja, et al. 2024. “Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?” arXiv:2407.21792. Preprint, arXiv, December 27. https://doi.org/10.48550/arXiv.2407.21792.


Sandholm, Tuomas W., and Robert H. Crites. 1996. “Multiagent Reinforcement Learning in the Iterated Prisoner’s Dilemma.” Biosystems 37 (1): 147–66. https://doi.org/10.1016/0303-2647(95)01551-5.


Schoen, Bronson, Evgenia Nitishinskaya, Mikita Balesni, et al. 2025. “Stress Testing Deliberative Alignment for Anti-Scheming Training.” arXiv:2509.15541. Preprint, arXiv, September 19. https://doi.org/10.48550/arXiv.2509.15541.

Cecilia Elena Tilli
Associate Director (Research & Grants), Cooperative AI Foundation