7 – Back-fire Risks

An important challenge for the field of cooperative AI is that the capabilities that would make AI systems more competent in cooperation may also make these systems more dangerous. Depending on the context of deployment, cooperation failures could be catastrophic, but the solution is not as simple as advancing every capability that is useful for cooperation as that could also increase other risks of misuse or loss of control.

‍

By the end of the section, you should be able to:

‍

Describe the concept of differential technology development and its relevance for prioritising work within the field of Cooperative AI.
Explain why many capabilities that support cooperation (e.g. commitments, theory of mind, punishment mechanisms) are dual-use and may introduce new risks.
Identify settings in which certain cooperation may be undesirable.

‍

A term that is often used to describe this kind of phenomenon is dual-use capabilities. The term “dual-use” originated during the Cold War, in the context of export controls and arms regulation, and was initially referring to something that could be used both for civilian and military purposes. Today the term is used more broadly in the sense of technology that can be useful in both benign and harmful ways.

‍

We will now cover a couple of salient examples of how cooperative capabilities can be dual-use, starting with the capability to make credible commitments. If we consider canonical cooperation problems such as the Prisoner’s Dilemma or the Tragedy of the Commons, the ability to make certain commitments could be helpful for cooperation.

Exercise

Explain the difference between a credible and non-credible commitment (you might need to look up these terms). Do you think credible commitments will be more tractable among advanced AI agents than human agents? Why?

Required • 10mins

Commitments

This research agenda from the Center on Long-term Risk outlines the risks associated with commitment capabilities.

Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda

'Introduction' excluding '1.2 Outline of the agenda', and '2.5 Potential downsides of research on cooperation failures' and '2.2 Commitment and transparency'

Required • 15mins

Defining Cooperative AI

Commitments

Transparency

Threats

Exercise

The research agenda from Center on Long-term Risk above mentions the commitment race problem. In some situations, it is strategically advantageous to “move first”, or credibly commitment to a policy of behaviour that does not depend on strategic information about another agent. Come up with examples of such situations. Try to think of examples where multiple agents are incentivised to “move first” and explain the problems with such a race to do so in the examples you have given.

Required • 20mins

Commitments

Back-fire Risks

Another dual-use capability that is central to cooperation is theory of mind: the ability to reason about the mental and emotional states of others. It is evident how understanding others can facilitate cooperation, but just as commitments can be used for threats and extortion, theory of mind can be used for manipulation and deception. The next resource covers the opportunities and risks of theory of mind in large language models. Pay special attention to section 3.2: “Cooperation and competition”, which explains the dual use aspects of theory of mind in a multiagent setting.

LLM Theory of Mind and Alignment: Opportunities and Risks

'1 - Introduction' and '3 - Group level'

Required • 30mins

Theory of Mind

Manipulation

Exercise

From ‘LLM Theory of Mind and Alignment: Opportunities and Risks’: “If LLMs were to become more accurate at ToM or to achieve ToM at higher orders of intentionality than the users deploying them, or the other humans encountering them, there may be further risks”. Explain these further risks in your own words.

Required • 15mins

Theory of Mind

Back-fire Risks

Another dual-use capability that might be less obvious is the ability for strategic punishment, often referred to as sanctioning. If we think of this in the context of coercion or extortion, it might intuitively seem like a capability that we would not want AI systems or agents to have; it is very easy to imagine how this could be dangerous.

‍

Sanctioning is however also a dual-use capability that can be crucial to achieve sustainable cooperation. Think back to the Cooperative Infrastructure section of this course and the material on social norms; such norms are upheld by sanctioning and can not function if agents do not have any sanctioning capabilities. Another related capability from the Cooperative Agents section that is also dual-use is the capability (or set of capabilities) for opponent shaping.

‍

Essentially all the capabilities that provide influence over others - over their learning, their behaviour, their understanding of the world - are dual-use and can contribute both to safety and to risks depending on the specific circumstances. In particular, a misaligned AI system that has strong capabilities to influence others would be much more dangerous than one that did not, which is a good reason to be very careful with advancing such capabilities as long as we are not confident about solving alignment. This takes us to another important concept: differential technology development (also referred to as differential progress).

‍

Differential technology development refers to the idea that how fast we develop certain technologies relative to others can greatly affect the long-term outcomes for humanity. An example that is often used for illustration is the relative timing of the invention (and deployment) of fast cars versus seat belts and how this influences the amount of traffic related deaths. Applying this concept to AI technology, we would like to develop things that promote safety as early as possible while potentially delaying the development of things that increase risk. This is, quite clearly, easier said than done.

‍

The next resource covers high-level practical challenges with achieving differential development (note that even while the page prompts you to create an account, there is an option to download the paper without registration):

Differential technology development: An innovation governance consideration for navigating technology risks

5.3: Challenges for timing interventions

Required • 10mins

Governance

Differential Progress

So what would differential technology development look like in cooperative AI? This is an important and difficult question which does not yet have a clear answer. There is an inherent difficulty of identifying safety technologies to address risks that have not yet materialised. Going back to the seat belt analogy, it might have been rather difficult to predict the importance of seat belts before the first car accidents had occurred.

‍

In practice, for those that work in cooperative AI, the approach to differential progress is often based on an active consideration of risk at any point when a choice is made about research directions or publications. As a decision to carry out and publish research or deploy a technology is often irreversible, it is good to err on the side of caution and solicit advice from safety-oriented peers when considering potentially risk-enhancing directions of work.

Exercise

Can you think of any research problems in cooperative AI that seem like they would be robustly good to solve sooner rather than later? Can you think of any that would be better to delay for now?

Required • 20mins

Differential Progress

In addition to the dual-use characteristics of many capabilities associated with cooperation, it is also important to recognise that while the word “cooperation” generally has positive connotations, cooperation among AI agents will not always be desirable. It is not only the case that things that are useful for cooperation can be dangerous if used for other purposes; cooperation in and of itself can be bad. Undesired cooperation is often referred to as collusion; it is however important to note that there is not really any clear-cut technical distinction between cooperation and collusion.

‍

The final resource of this section covers collusion risks for AI agents, including some specific examples of what this could mean and some important directions of research for this area to mitigate collusion risks.

Multi-Agent Risks from Advanced AI

2.3 Collusion

Required • 20mins

Collusion

Exercise

List some example instances where cooperation among AI agents might be undesirable and why.

Required • 10mins

Collusion

Exercise

Look through the papers of previous sections of the course. Can you identify any of them where there might be a downside risk of that research? Can you identify any that you think has no downside risks?

Required • 45mins

Back-fire Risks

Prioritisation

Next Section

7 Back-fire Risks