Evaluation for Cooperative AI

CAIF is seeking proposals for research that improves our ability to evaluate cooperation-relevant features of AI systems. At this early stage in the development of the Cooperative AI field, we think it is crucial that we attain conceptual clarity around cooperative intelligence and related concepts, are able to measure key aspects of cooperative behavior, and develop a collection of benchmarks rich enough to yield generalizable insights about Cooperative AI. Anyone is eligible to apply, and we welcome applications from disciplines outside of computer science.


CAIF’s mission is to support research that will improve the cooperative intelligence of advanced AI for the benefit of all, although even defining cooperative intelligence is not straightforward, as we discuss further further below in the first supporting document associated with this call. Importantly, our focus is on AI systems deployed by self-interested human principals. Thus, the goal of Cooperative AI is not to support the development of selflessly altruistic AI systems, except insofar as self-interested principals regard using these systems as in their interest. Rather, our mission is based on the hypothesis that many improvements in the ability of self-interested actors to cooperate will make humanity as a whole better off.

We are ultimately interested in the development of AI systems that can assist humans in their interactions with one another, and autonomously cooperate with a variety of agents with a range of different preferences. These variations give rise to several different contexts in which we might wish to evaluate how cooperative an AI system is, such as:

  • Evaluation with humans. If AI systems are intended to assist humans, then ideally they will be evaluated in interactions with humans. A recent example is the use of environments inspired by the cooperative video game Overcooked.
  • Evaluation with a fixed distribution of AI agents. If AI systems are instead intended to interact with other AI systems (or evaluation against humans is infeasible), it may instead be appropriate to evaluate them against a fixed distribution of other AI systems. This is the setting studied in work on ad-hoc coordination and in the Melting Pot benchmark suite.
  • Evaluation with a dynamic distribution of AI agents. Evaluation against a fixed distribution of other agents has the benefit of allowing researchers to choose a distribution of agents with properties they consider interesting or important, but this approach has several shortcomings. First, it may not be clear how to choose this distribution. Second, it may not be realistic to assume that this distribution will remain fixed when, for instance, other developers might deploy new algorithms in response. One alternative is therefore to evaluate against an evolving distribution of agents, as is sometimes done in recurring competitions, for example.
  • Evaluation given possible coordination between developers. The approaches above are particularly appropriate for making recommendations to individual AI developers, who build their systems independently. However, it is also important to consider what might be gained from public recommendations made to all AI developers. Game-theoretically, we could view this as a correlated equilibrium between AI developers, whose strategies are the AI agents that they deploy.

Finally, we note that there are many other dimensions along which the training and evaluation of agents may vary, such as centralized vs. decentralized training, online vs. offline learning, and whether agents have an explicit utility or reward function.

Here we briefly describe the two research directions on which we invite proposals. More in-depth discussion of each direction is provided in the supporting documents further below. These documents highlight aspects of these research directions that we currently believe to be most important, but do not cover all problems that successful proposals might address.

  1. Developing and measuring key Cooperative AI concepts. Clarity around key Cooperative AI concepts is critical for laying firm foundations for the field. This includes developing rigorous definitions of existing concepts like cooperation, cooperative intelligence, and individual cooperative capabilities (coordination, modeling other agents, etc). It may also include developing new concepts not yet in our vocabulary. Finally, it might include work to clarify our precise goals. For instance, against what distribution of other agents should an agent’s cooperative intelligence be assessed? If cooperative intelligence involves the attainment of high social welfare, what measure(s) of social welfare ought to be used? In addition to this conceptual work, we need practical ways of measuring the concepts in question, for various kinds of AI systems. Aside from the measurement of a unitary “cooperative intelligence” construct, this could involve measuring specific cooperative capabilities, or developing diagnostics for cooperation failure.
  2. Environments and datasets for Cooperative AI research. To encourage and evaluate progress (and, in particular, differential progress) on cooperative intelligence, we need benchmarks that pose rich cooperation challenges for AI systems. Work in this direction should develop datasets or environments that help fuel and assess the development of Cooperative AI, and will ideally be compatible with advances in AI capabilities. Of special interest are benchmarks that evaluate the generalization of cooperative competencies to novel environments, and benchmarks that provide insights into the cooperative capabilities of large language models. In the latter case, we may be able to connect grantees with industry partners for access to private models. For publicly accessible models, see EleutherAI's GPT-NeoX and OpenAI’s API, which provides structured access to GPT-3.
Research Directions

Proposed projects should fall under one (or both) of the research directions listed above. More specific guidance can be found in the supporting documents for each research direction. Applicants should submit a 2-5 page proposal including a budget (references do not count towards this page limit) and their CV(s). Please note that our default policy is to limit indirect costs (overheads) to 10% of the total grant value.

Applications will be accepted on a rolling basis, but we encourage timely submission in order to maximize the chance of funding. Proposals will be evaluated in accordance with our grantmaking principles. Anyone is eligible to apply, and we welcome applications from disciplines outside of computer science.

Application Process

We are grateful to Asya Bergal, Jakob Foerster, Gillian Hadfield, Joe Halpern, Natasha Jaques, Joel Leibo, and Caspar Oesterheld for providing feedback on previous versions of this call for proposals and the supporting documents below. We also wish to thank Noam Brown, Vince Conitzer, Allan Dafoe, José Hernández-Orallo, Max Kleiman-Weiner, Kamal Ndousse, and Rohin Shah for previous helpful discussions on these topics.

Defining and Measuring Key Cooperative AI Constructs

Jesse Clifton and Lewis Hammond

We first review some challenges for rigorously defining and measuring cooperative intelligence. Research proposals might address these questions, or raise new conceptual or methodological challenges for the assessment of cooperative intelligence. We then list several other directions in the measurement and evaluation for Cooperative AI: measuring individual cooperative capabilities, diagnosing cooperation failures, and evaluation without explicit utility functions. This list does not necessarily exhaust the research directions that CAIF would be interested in funding.

Challenges for Defining and Measuring Cooperative Intelligence

We are ultimately interested in ensuring cooperation for the benefit of all, and successful cooperation involves joint action resulting in greater social welfare. Thus, we wish to measure properties of agents that allow us to predict the extent to which those agents will be able to work together to improve their social welfare in a wide variety of circumstances. Note that by “social welfare” we mean some formal measure of how well an outcome satisfies all agents’ preferences – for instance, the sum or product of all agents’ payoffs.

To illustrate the challenges involved in defining and measuring cooperative intelligence, we introduce the following example of a working definition, based on the Legg and Hutter (2007) definition of intelligence as an agent’s ability to achieve its goals in a wide variety of environments:

Cooperative intelligence is an agent’s ability to achieve their goals in ways that also promote social welfare, in a wide range of environments and with a wide range of other agents.

The notion of cooperative intelligence gestured at in this possible definition has some features that make it challenging to more formally define and measure, as we highlight below. Conceptual work is needed to come up with definitions of cooperative intelligence and related concepts that address these issues. Technical work is needed to measure them, as this may require innovative experimental design.

The role of intentions. In many circumstances, achieving high welfare requires joint action, as opposed to creating spillover benefits for other agents via independent action. For example, consider beavers building a dam, which incidentally creates a pond for fish to live in, despite the fact that the beavers did not intend to create a pond for the fish – i.e., even if there were no fish, the beavers' actions would have been the same (we thank Allan Dafoe for this example). This means the fact that the animals attained high welfare in this setting is not good evidence of their ability to attain high welfare in settings that do require joint effort (as when, for example, solving a social dilemma).

Byproduct cases are one reason that some writers have emphasized the causal or intentional aspects of cooperation. For example, Paternotte (2014) writes that “[a] first obvious fact is that the cooperative nature of a set of individual actions is underdetermined by observable behavior”. West et al. (2007) distinguish between several concepts in evolutionary biology – including mutualism, mutual benefit, cooperation, altruism, and fitness – and define cooperation as “a behavior which [sic] provides a benefit to another individual (recipient), and is selected for because of its beneficial effect on the recipient”. One example of early-stage work aimed at addressing this issue is given in Jesse Clifton and Sammy Martin's New Directions in Cooperative AI seminar, but other approaches, such as those based on a formal definition of intention (Halpern and Kleiman-Weiner, 2018), may also prove fruitful.

Dependence on the distribution of other agents. In the working definition above, cooperatively intelligent agents achieve high social welfare via joint action with a wide range of other agents. How should we choose the distribution of other agents on which cooperative intelligence is measured? One answer to this question can be found in the work of Hernández-Orallo et al. (2011), who propose the “Darwin-Wallace distribution” – a distribution over agents constructed by running a particular evolutionary process – for the measurement of intelligence in multi-agent settings. In the benchmark suite Melting Pot (Leibo et al. 2021; Joel Leibo’s New Directions in Cooperative AI seminar), agents are evaluated against specially-constructed “background populations”.

An important consideration in the choice of the distribution of agents against which cooperative intelligence should be measured is the existence of multiple equilibria. Complex multi-agent environments typically exhibit multiple equilibria, including multiple Pareto-optimal equilibria (as demonstrated, for example, by the folk theorems for repeated games). Thus, if we only evaluate an agent against agents playing according to the same equilibrium, we may severely overestimate its ability to coordinate with a wider range of agents.

Underdetermination of “high social welfare”. The preceding definition relies on a measure of social welfare, but there are many competing ways of measuring this concept. Different welfare functions make different tradeoffs between equality and total payoffs, and different equilibria may maximize different social welfare functions. This means that the measurement of cooperative intelligence may require normative judgements. Work on this problem could draw from the rich literatures on social choice theory, welfare economics, political philosophy, and cooperative bargaining, among many other fields (Thomson, 1994).  

Tradeoffs between exploitability and welfare. Agents can cooperate (in the sense of attaining high-social welfare outcomes) with a wider variety of agents if they are willing to make themselves more exploitable. For instance, consider a negotiator who always accepts the first offer. Deterring exploitation is part of achieving one’s goals, but may come at the cost of lower social welfare. How should exploitability be traded off against an agent’s ability to attain high-social welfare outcomes with a wider variety of counterparts? This problem is discussed by Stastny et al. (2021) under the name “(cooperative) robustness-exploitability tradeoff”.

Cooperation vs. coercion. Distinguishing between coercion and cooperation can be challenging (Schelling 1980; Nozick 1969), as coercion may lead to improvements in social welfare, such as when potential defectors from a socially optimal outcome are threatened with punishment. Should a definition of cooperative intelligence discount cases where social welfare is improved due to coercion? If so, how precisely ought we modify the definition above? Moreover, how might we even detect coercion in AI systems, given its apparent similarity to cooperation in certain contexts?

Other Directions in Measurement and Evaluation

In the previous section, we focused on the definition and measurement of a one-dimensional notion of cooperative intelligence, but there are many other directions that may prove fruitful in evaluating the cooperation-relevant features of AI systems.

Measuring specific cooperative capabilities. For instance, we may want to know how skilled an agent is at communication, coordination, modeling other agents, or overcoming informational problems, bargaining problems, and social dilemmas. We are also interested in metrics such as the rate at which these competencies are acquired.

Diagnosing failures of cooperation. In addition to measures of cooperative capabilities, diagnostic tools could prove invaluable for understanding the circumstances under which a particular group of agents exhibits significant failures to cooperate. This includes tools for interpreting agents’ cooperation-relevant representations, such as their beliefs about other agents.

Evaluation based on relative performance. In a multi-agent setting, the evaluation of agents based on their relative performance can make a game more competitive than intended, and incentivize harming other agents. For instance, evolutionary models show that competition between distantly related individuals or groups can select for spitefulness and aggression (Hamilton 1970; Gardner and West 2004; Choi and Bowles 2007). Similarly, a benchmark environment in which teams of agents directly compete against one another may incentivize agents to harm one another. One important research direction is therefore to develop methods for evaluation that avoid these incentives.

Evaluation without explicit utility functions. Because social welfare is a function of agents’ individual utilities, an operationalization of cooperative intelligence in terms of social welfare is only straightforwardly applicable to real-world AI systems when those systems have explicitly-specified utility functions. This could include, for example, reinforcement learning agents trained on a hard-coded reward signal, but may not be as easily extended to systems trained via unsupervised learning or human feedback. One possible direction for measuring the cooperative capabilities of agents without an explicit utility function would be to elicit human judgements. Work here could draw on methods from the well-established field of preference elicitation, and from more recent work on learning from human preferences (Christiano et al. 2017; Stiennon et al. 2020; Askell et al. 2021; Ouyang et al. 2022).

Using human feedback in this way raises new methodological and normative issues. For example, while humans are excellent cooperators in many respects, they are not ideally cooperatively intelligent. This means that we may not always want to treat human judgements about cooperative capability as “ground truth”. How, then, should human judges be instructed on how to evaluate AI systems’ cooperativeness? One direction could be to study systematic mistakes that impede human cooperation (see, e.g., Caputo 2013 for a review in the context of negotiation) and propose methods for correcting these biases in AI training regimes that require human feedback. Another direction might seek to address the fact that there is variability in human judgements of fairness (Henrich 2000; Oosterbeek, Sloof, and van de Kuilen 2004).  

Cooperation-specific solution concepts. Developing new solution concepts (in which cooperation might play a central role, for instance; cf. Rabin’s (1993) “fairness equilibrium” and Rong’s and Halpern’s (2010) “cooperative equilibrium”) could enrich our understanding of what constitutes behavior that is both rational and cooperative.

Group-level capabilities. So far we have discussed the cooperative capabilities of individual agents. However, successful cooperation depends on the behavior of many agents, and individuals may be limited in their ability to increase the degree of cooperation. Thus, it may often be more helpful to think about the cooperative intelligence of a group. For one such example, see Woolley et al.’s (2010) work on the measurement of collective intelligence. This idea is also related to the aforementioned distinction between recommendations to a single AI developer vs. a collection of AI developers.

Additional Guidance

In addition to the general evaluation criteria for this call for proposals described further above, we provide the following guidance.

  • An ideal definition or metric should be both theoretically principled and practically useful. For an example of principled formalizations of blameworthiness, intention, and moral responsibility (which may be related to key Cooperative AI concepts), see Halpern and Kleiman-Weiner (2018). For examples of game-theoretic criteria for multi-agent learning algorithms, see Powers and Shoham (2004) and Conitzer and Sandholm (2007). See also definitions of cooperation and related concepts from philosophy (as in Tuomela 1993) and evolutionary biology (as in West et al. 2007), upon which theoretically principled work on Cooperative AI could build.
  • Theoretical constructs may be intractable or impossible to measure directly. For instance, a definition of cooperation might require certain kinds of intentions on the part of the agents, and AI systems’ intentions are not, in general, directly observable. Nevertheless, a good metric will be an estimator of, or an approximation to, the theoretical construct that we would ideally like to measure.
  • The setting in which a method is intended to be applied should be made clear. For instance, it should be stated which form of training and evaluation the method is intended to be applied.

Supporting documents


Askell, Amanda, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, et al. 2021. “A General Language Assistant as a Laboratory for Alignment.” arXiv:2112.00861.

Caputo, Andrea. 2013. “A Literature Review of Cognitive Biases in Negotiation Processes.” International Journal of Conflict Management 24 (4): 374–98.

Choi, Jung-Kyoo, and Samuel Bowles. 2007. “The Coevolution of Parochial Altruism and War.” Science 318 (5850): 636–40.

Christiano, Paul, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 4302–10.

Conitzer, Vincent, and Tuomas Sandholm. 2007. “AWESOME: A General Multiagent Learning Algorithm That Converges in Self-Play and Learns a Best Response against Stationary Opponents.” Machine Learning 67 (1): 23–43.

Gardner, A., and S. A. West. 2004. “Spite and the Scale of Competition.” Journal of Evolutionary Biology 17 (6): 1195–203.

Halpern, Joseph, and Max Kleiman-Weiner. 2018. “Towards Formal Definitions of Blameworthiness, Intention, and Moral Responsibility.” In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 1853–60.

Hamilton, W. D. 1970. “Selfish and Spiteful Behaviour in an Evolutionary Model.” Nature 228 (5277): 1218–20.

Henrich, Joseph. 2000. “Does Culture Matter in Economic Behavior? Ultimatum Game Bargaining among the Machiguenga of the Peruvian Amazon.” The American Economic Review 90 (4): 973–9.

Hernández-Orallo, José, David L. Dowe, Sergio España-Cubillo, M. Victoria Hernández-Lloreda, and Javier Insa-Cabrera. 2011. “On More Realistic Environment Distributions for Defining, Evaluating and Developing Intelligence.” In Artificial General Intelligence, 82–91. Springer Berlin Heidelberg.

Legg, Shane and Hutter, Marcus. 2007. “Universal Intelligence: A Definition of Machine Intelligence”. Minds and Machines 17: 391–444.

Leibo, Joel Z., Edgar A. Dueñez-Guzman, Alexander Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charlie Beattie, Igor Mordatch, and Thore Graepel. “Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot.” In Proceedings of the 38th International Conference on Machine Learning, 6187–99.

Nozick, Robert. 1969. “Coercion.” In Philosophy, Science, and Method: Essays in Honor of Ernest Nagel, edited by Sidney Morgenbesser, 440–72. St Martin’s Press.

Oosterbeek, Hessel, Randolph Sloof, and Gijs van de Kuilen. 2004. “Cultural Differences in Ultimatum Game Experiments: Evidence from a Meta-Analysis.” Experimental Economics 7 (2): 171–88.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv:2203.02155.

Paternotte, Cédric. 2014. “Minimal Cooperation.” Philosophy of the Social Sciences 44 (1): 45–73.

Powers, Rob, and Yoav Shoham. 2004. “New Criteria and a New Algorithm for Learning in Multi-Agent Systems.” In Proceedings of the 18th International Conference on Neural Information Processing Systems, 1089–96.

Schelling, Thomas C. 1980. The Strategy of Conflict: With a New Preface by the Author. Harvard University Press.

Stastny, Julian, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. 2021. “Normative Disagreement as a Challenge for Cooperative AI.” arXiv:2111.13872.

Stiennon, Nisan, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. “Learning to Summarize with Human Feedback.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 3008–21.

Tuomela, Raimo. 1993. “What Is Cooperation?” Erkenntnis. An International Journal of Analytic Philosophy 38 (1): 87–101.

West, S. A., A. S. Griffin, and A. Gardner. 2007. “Social Semantics: Altruism, Cooperation, Mutualism, Strong Reciprocity and Group Selection.” Journal of Evolutionary Biology 20 (2): 415–32.

Woolley, Anita Williams, Christopher F. Chabris, Alex Pentland, Nada Hashmi, and Thomas W. Malone. 2010. “Evidence for a Collective Intelligence Factor in the Performance of Human Groups.” Science 330 (6004): 686–88.


Askell, Amanda, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, et al. 2021. “A General Language Assistant as a Laboratory for Alignment.” arXiv:2112.00861.

Bard, Nolan, Jakob N. Foerster, Sarath Chandar, Neil Burch, Marc Lanctot, H. Francis Song, Emilio Parisotto, et al. 2020. “The Hanabi Challenge: A New Frontier for AI Research.” Artificial Intelligence 280: 103216.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 1877–901.

Carroll, Micah, Rohin Shah, Mark K. Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. 2019. “On the Utility of Learning about Humans for Human-AI Coordination.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 5174–85.

Chawla, Kushal, Jaysa Ramirez, Rene Clever, Gale Lucas, Jonathan May, and Jonathan Gratch. 2021. “CaSiNo: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3167-85.

Christiano, Paul, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 4302–10.

He, He, Derek Chen, Anusha Balakrishnan, and Percy Liang. 2018. “Decoupling Strategy and Generation in Negotiation Dialogues.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2333–43.

Hendrycks, Dan, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2020. “Aligning AI With Shared Human Values.” In Proceedings of the 9th International Conference on Learning Representations.

Hughes, Edward, Thomas W. Anthony, Tom Eccles, Joel Z. Leibo, David Balduzzi, and Yoram Bachrach. 2020. “Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games.” In Proceedings of the 19th Conference on Autonomous Agents and Multi-Agent Systems, 538-47.

Leibo, Joel Z., Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. “Multi-Agent Reinforcement Learning in Sequential Social Dilemmas.” In Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems, 464–73.

Leike, Jan, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. 2018. “Scalable Agent Alignment via Reward Modeling: A Research Direction.” arXiv:1811.07871.

Lewis, Mike, Denis Yarats, Yann N. Dauphin, Devi Parikh, and Dhruv Batra. 2017. “Deal or No Deal? End-to-End Learning for Negotiation Dialogues.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2443–53.

Paquette, Philip, Yuchen Lu, Seton Steven Bocco, Max Smith, Satya O.-G., Jonathan K. Kummerfeld, Joelle Pineau, Satinder Singh, and Aaron C. Courville. 2019. “No-Press Diplomacy: Modeling Multi-Agent Gameplay.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 4474–85.

Stiennon, Nisan, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. “Learning to Summarize with Human Feedback.” In Proceedings of the 34th International Conference on Neural Information Processing Systems: 3008–21.

Tennenholtz, Moshe. 2004. "Program Equilibrium." Games and Economic Behavior 49 (2): 363–73.

Wang, R. E., S. A. Wu, J. A. Evans, J. B. Tenenbaum, D. C. Parkes, and M. Kleiman-Weiner. 2020. “Too Many Cooks: Coordinating Multi-Agent Collaboration through Inverse Planning.” In Proceedings of the 19th International Conference on Autonomous Agents and Multi-Agent Systems, 2032–4.

Yarats, Denis, and Mike Lewis. 2018. “Hierarchical Text Generation and Planning for Strategic Dialogue.” In Proceedings of the 35th International Conference on Machine Learning, 5591–9.

Environments and Datasets for Cooperative AI Research

Jesse Clifton and Lewis Hammond

Tasks for training and evaluating a variety of cooperative competencies, in a variety of AI paradigms, will be crucial to the success of Cooperative AI. We first describe the features of environments and datasets that we believe to be particularly important. We then give two examples of the kinds of projects CAIF is interested in funding: multi-agent learning environments, and datasets and evaluation protocols for large language models (LLMs).


We are particularly interested in supporting the development of tasks (environments and datasets) with the following features.

Diversity of challenges for cooperation. Tasks will ideally require a host of cooperative competencies, including communication, coordination, modeling other agents, overcoming informational problems, bargaining problems, and social dilemmas, as well as support the ability to test against a variety of different kinds of agents.    

Testing generalization. We want AI systems to acquire cooperative competencies that generalize to a wide set of environments and counterpart agents. This is one reason to prefer benchmarks that consist of a suite of tasks, such as Melting Pot, over benchmarks consisting of a single challenge task (such as Hanabi or Diplomacy; see below for background on these environments).

Encouragement of creative solutions. Improvements in cooperation can often be obtained by changing the “rules of the game”. For instance, agents with a high degree of mutual transparency can take advantage of program equilibrium-type approaches to resolving social dilemmas (Tennenholtz 2004; Vincent Conitzer’s New Directions in Cooperative AI seminar). Ideally, new environments will allow researchers to exercise significant creativity in designing agents to be more cooperatively intelligent.  

Evaluation and learning from human feedback. Due to the difficulty of specifying desirable behavior via an explicit reward function in many domains, learning from human feedback will plausibly be an increasingly important approach to the design of AI systems (Christiano et al. 2017; Stiennon et al. 2020; Leike et al. 2018). It is therefore valuable to have tasks that can support the collection of human feedback (e.g., by including human-understandable tasks and interfaces). See also the discussion of “[e]valuation without explicit utility functions” in the preceding document.

Forward-compatibility with more advanced AI techniques. Given our interest in advanced AI systems, we would like infrastructure that supports the introduction of new environments that are compatible with new AI techniques. For instance, LLMs could play a key role in the development of more generally intelligent systems. Ideal multi-agent learning environments should avoid precluding the use of new techniques, such as incorporating language model-based communication.

Clear and precise evaluation protocols. It should be clear to researchers how the methods they develop will be evaluated on the task. On the other hand, having clear and precise evaluation protocols may be in tension with allowing creative approaches to improving cooperation. In very open-ended settings, “fuzzier” evaluation protocols, such as the judgements of a human jury, might be preferable.

Accessibility and user-friendliness. Ideal benchmarks will be easy for researchers to use, such as by avoiding the requirement of large amounts of compute and/or engineering talent in order to make research contributions, where possible.

Comparability with human players. A helpful feature of some existing benchmarks (such as Hanabi and Diplomacy) is that human experts provide a clear point of comparison. Moreover, in many domains we may be concerned with the ability of AI systems to cooperate with humans, rather than other AI systems.

Incentivizing differential progress. Given our focus on differential progress in cooperative intelligence (Jesse Clifton and Sammy Martin's New Directions in Cooperative AI seminar), we will prioritize supporting environments and datasets that facilitate the training and evaluation of differentially cooperative capabilities. To this end, it is important to consider the following questions.

  • Does the task make it possible for researchers to detect when agents are trying to deceive other agents? Does the environment reward deceptiveness, or do agents perform better if they find ways to incentivize one another to tell the truth?
  • Does optimal performance on the task involve cooperation between all agents, as opposed to conflict between coalitions?
  • Does the task make it possible for researchers to distinguish between cooperation and coercion? While this might not always be straightforward (Schelling 1980; Nozick 1969) but we might still hope to distinguish between cooperative agents and agents that enforce “unfair” outcomes with the threat of punishment.

On the other hand, environments will only spur differential progress in cooperative intelligence if it is non-trivial for self-interested agents to cooperate in them. For example, in sequential social dilemmas (Leibo et al. 2017) self-interested agents can maximize their score by (conditionally) cooperating, but training agents to engage in conditional cooperation rather than defection is not easy.

Multi-Agent Learning Environments

There are several existing environments relevant to evaluating the cooperative capabilities of contemporary multi-agent learning algorithms.

  • Hanabi (Bard et al. 2020) and Overcooked (Carroll et al. 2019; Wang et al. 2020) are fully cooperative environments that have recently been studied with a focus on human-AI coordination.
  • Diplomacy (Paquette et al. 2019) is a 7-player, zero-sum game involving both cooperation and conflict. Although leading to interesting problems of cooperation, such as “alliance dilemmas” (Hughes et al. 2020), the fact that it is zero-sum makes it non-ideal for studying and promoting differential progress on cooperative intelligence.
  • Melting Pot (Leibo et al. 2021; Joel Leibo’s New Directions in Cooperative AI seminar) is a suite of multi-agent reinforcement learning environments designed to test the generalization of social-cognitive abilities to new populations of agents. The suite currently includes a variety of mixed-motive games involving social dilemmas and coordination problems, as well as a small number of zero-sum and purely cooperative environments.

CAIF is interested in supporting work to develop new environments that will improve our ability to evaluate the cooperative capabilities of multi-agent learning algorithms. These may be modifications to existing environments (e.g., modifications to Diplomacy to make it a mixed-motive rather than zero-sum game) or wholly new environments (e.g., environments appropriate for inclusion in the Melting Pot suite, or a new “grand challenge” problem for Cooperative AI).

Datasets and Evaluation for Large Language Models

Large language models (LLMs) could play a significant role on the frontier of AI capabilities in the coming years (Brown et al. 2020; Bommasani et al. 2021). The kinds of datasets and evaluation protocols that might facilitate Cooperative AI research using LLMs include the following.

Negotiations. In the short-term, negotiation provides an interesting test case for certain cooperative capabilities (understanding other agents’ interests, and appropriately trading off one’s interests with other agents’). In the long run, AI systems may assist with high-stakes negotiations on behalf of humans (e.g., between large firms, or perhaps even states), and avoiding failure in those kinds of interactions will generally be in humanity’s interest. There are a number of existing datasets of simple natural language negotiations – including negotiations over how to divide a set of items, or over the price of an item for sale on Craigslist (Yarats and Lewis 2018; Lewis et al. 2017; He et al. 2018) – but there has been little work on negotiation with respect to LLMs.

Human judgements about cooperatively intelligent behavior. Evaluating the ability of LLMs (and other powerful models) to understand and reproduce human judgements about cooperatively intelligent behavior could be important to safely integrating these systems into society and settings involving other AI systems. Related efforts include Askell et al. (2021)’s study of the behavior of language models on the ETHICS dataset of Hendrycks et al. (2020), which collects natural language descriptions of different situations and human judgements about, for example, whether some behavior was morally appropriate. In combination with the preceding topic, one potentially useful dataset could contain human judgements of different aspects of the negotiators’ behavior. For example, Chawla et al. (2021) collect human annotations of simple natural language negotiations, which include things like “coordination”, “empathy”, “self-need”, and “other-need”.

Finally, two features that might make datasets for Cooperative AI research on LLMs particularly useful are the following.

  • Easily-identifiable cooperative success and failure. In a negotiation, for example, the most natural indicator of cooperative success is that a deal was reached, but there may be other indicators we want to track (e.g., whether any of the parties behaved coercively). This may require input from human judges.
  • Ability to obtain interesting results using publicly-available LLMs. While this is important for lowering the bar for further research, it may prove difficult for the time being if publicly-available LLMs are insufficiently capable. In light of this, CAIF may be able to connect grantees with industry partners for access to private models. Another alternative might be for industry labs to run experiments with non-public models on a dataset and report their results.