7 Back-fire Risks
An important challenge for the field of cooperative AI is that the capabilities that would make AI systems more competent in cooperation may also make these systems more dangerous. Depending on the context of deployment, cooperation failures could be catastrophic, but the solution is not as simple as advancing every capability that is useful for cooperation as that could also increase other risks of misuse or loss of control.
Tooltip Text
Dual-Use Capabilities
A term that is often used to describe this kind of phenomenon is dual-use capabilities. The term “dual-use” originated during the Cold War, in the context of export controls and arms regulation, and was initially referring to something that could be used both for civilian and military purposes. Today the term is used more broadly in the sense of technology that can be useful in both benign and harmful ways.
We will now cover a couple of salient examples of how cooperative capabilities can be dual-use, starting with the capability to make credible commitments. If we consider canonical cooperation problems such as the prisoner’s dilemma or the tragedy of the commons, the ability to make certain commitments could be helpful for cooperation.
Tooltip Text
Recall the prisoner’s dilemma represented by the payoff matrix below:
We know that the Nash equilibrium of this game is (D, D). Now imagine that both players have the option to make the following commitment: if the other player defects, I will spend 3 utility points to reduce their utility by 3. (Perhaps the players can spend some of their time making the life of the other player more difficult in some way.) Write out a new 4x4 payoff matrix for this scenario, where the strategies for each player are: ‘cooperate’, ‘defect’, ‘commit and cooperate’, and ‘commit and defect’. Show that (commit and cooperate, commit and cooperate) is a (weak) Nash equilibrium of this new game.
Tooltip Text
This research agenda from the Center on Long-term Risk outlines the risks associated with commitment capabilities.
Tooltip Text
'Introduction' excluding '1.2 Outline of the agenda', and '2.2 Commitment and transparency' and '2.5 Potential downsides of research on cooperation failures'
Tooltip Text
The research agenda from Center on Long-term Risk above mentions the commitment race problem. In some situations, it is strategically advantageous to move first, or credibly commit to a policy of behaviour that does not depend on strategic information about another agent. Come up with examples of such situations. Try to think of examples where multiple agents are incentivised to move first and explain the problems with such a race in the examples you have given.
Tooltip Text
Another dual-use capability that is central to cooperation is theory of mind (ToM): the ability to reason about the mental and emotional states of others. It is evident how understanding others can facilitate cooperation, but just as commitments can be used for threats and extortion, theory of mind can be used for manipulation and deception. The next resource covers the opportunities and risks of theory of mind in LLMs. Pay special attention to section 3.2: cooperation and competition, which explains the dual-use aspects of theory of mind in multi-agent settings.
Tooltip Text
'1 - Introduction' and '3 - Group level'
Tooltip Text
From ‘LLM Theory of Mind and Alignment: Opportunities and Risks’: “If LLMs were to become more accurate at ToM or to achieve ToM at higher orders of intentionality than the users deploying them, or the other humans encountering them, there may be further risks”. Explain these risks in your own words.
Tooltip Text
Another dual-use capability that might be less obvious is the ability for strategic punishment, often referred to as sanctioning. If we think of this in the context of coercion or extortion, it might intuitively seem like a capability that we would not want AI systems or agents to have; it is very easy to imagine how this could be dangerous.
Sanctioning is, however, also a dual-use capability that can be crucial to achieve sustainable cooperation. Think back to the Cooperative Infrastructure section of this course and the material on social norms; such norms are upheld by sanctioning and cannot function if agents do not have any sanctioning capabilities. A related capability from the Cooperative Agents section that is also dual-use is the capability (or set of capabilities) for opponent shaping.
Essentially all the capabilities that provide influence over others—over their learning, their behaviour, their understanding of the world—are dual-use and can contribute both to safety and to risks depending on the specific circumstances. In particular, a misaligned AI system that has strong capabilities to influence others would be much more dangerous than one that did not, which is a good reason to be very careful with advancing such capabilities as long as we are not confident about solving alignment. This takes us to another important concept: differential technological development (also referred to as differential progress).
Tooltip Text
Return to the article on Agent Properties for Safe Interactions. Pick out some dispositions and capabilities that you think might be particularly dual-use or robustly positive for cooperation. For a couple of the ones you’ve picked out, explain your reasoning.
Tooltip Text
Differential Technological Development
Differential technological development refers to the idea that how fast we develop certain technologies relative to others can greatly affect the long-term outcomes for humanity. An example that is often used for illustration is the relative timing of the invention (and deployment) of fast cars versus seat belts and how this influences the amount of traffic-related deaths. Applying this concept to AI, we would like to develop things that promote safety as early as possible while potentially delaying the development of things that increase risk. This is, quite clearly, easier said than done. The next resource covers high-level practical challenges with achieving differential development (note that even while the page prompts you to create an account, there is an option to download the paper without registration):
Tooltip Text
So what would differential technological development look like in cooperative AI? This is an important and difficult question which does not yet have a clear answer. There is an inherent difficulty of identifying safety technologies to address risks that have not yet materialised. Going back to the seat belt analogy, it might have been rather difficult to predict the importance of seat belts before the first car accidents had occurred.
In practice, for those that work in cooperative AI, the approach to differential progress is often based on an active consideration of risk at any point when a choice is made about research directions or publications. As a decision to carry out and publish research or deploy a technology is often irreversible, it is good to err on the side of caution and solicit advice from safety-oriented peers when considering potentially risk-enhancing directions of work.
Tooltip Text
Collusion
In addition to the dual-use characteristics of many capabilities associated with cooperation, it is also important to recognise that while the word “cooperation” generally has positive connotations, cooperation among AI agents will not always be desirable. It is not only the case that capabilities that are useful for cooperation can be dangerous if used for other purposes; cooperation in and of itself can be bad. Undesired cooperation is often referred to as collusion, which generally means secretive cooperation between two or more parties at the expense of one or more other parties. The final resource of this section covers collusion risks for AI agents, including some specific examples of what this could mean and some important research directions to mitigate collusion risks.
Tooltip Text
