← Back to Overview
Sandboxes and Testbeds
What It Is and Why It Matters

Empirical work on the safety of multi-principal, multi-agent deployments is bottlenecked by the need for realistic and reproducible testbeds. Existing work has made important progress using stylised games and simulated social environments (e.g., Akata et al., 2025; Gandhi et al., 2023; Park et al., 2023), though these settings face important limitations. We therefore seek testbeds that allow researchers to study populations of frontier-model agents interacting over extended periods with realistic tools, memory systems, economic constraints, and communication channels (Kapoor et al., 2026). While we are interested in seeing proposals that address this section alone, we expect such environments to be deliberately designed to support the comparative evaluation of the theoretical measures, protocols, and oversight mechanisms discussed in later sections.

Specific Work We Would Like to Fund

Sandboxes and testbeds should be:

  • Scalable to a realistic number of agents and an appropriate diversity of agentic tools, knowledge bases, etc.;
  • High-fidelity, capturing behaviours of frontier AI agents rather than coarse abstractions;
  • Externally valid, enabling a principled characterisation of the deployment conditions under which simulation-derived conclusions should and should not be trusted;
  • Safe and secure, allowing dangerous collective behaviours to be studied without risk of uncontrolled deployment;
  • Reproducible, enabling researchers to compare different methods.

We also welcome proposals that are not focused on a particular testbed or environment itself, but contribute significantly to addressing one or more of the problems above, such as:

  • Navigating the trade-off between scalability and fidelity, for example, by using smaller, distilled models to serve as faithful proxies for frontier agents in simulations;
  • Developing principled methods to utilise data from real-world deployments to design grounded environments and to evaluate their external validity;
  • Building environment-agnostic infrastructure that enables interoperation between testbeds or methods for logging information across many agents.
Key Considerations

Please see the guidelines on research areas and out-of-scope topics.

References

Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz (2025). "Playing Repeated Games with Large Language Models". Nature Human Behaviour (9), pp. 1380–1390.

Gandhi, Kanishk, Dorsa Sadigh, and Noah D. Goodman (2023). "Strategic Reasoning with Language Models". arXiv:2305.19165.

Kapoor, Sayash, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, et al. (2026). "Open-World Evaluations for Measuring Frontier AI Capabilities". arXiv:2605.20520.

Park, Joon Sung, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein (2023). "Generative Agents: Interactive Simulacra of Human Behavior". In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23).

Priority Research Areas
The Science of Agent Networks
The Science of Agent Networks
Strengthening Agent Infrastructure
Strengthening Agent Infrastructure
Multi-Agent Oversight and Control
Multi-Agent Oversight and Control
Secondary Research Areas
No items found.