Empirical work on the safety of multi-principal, multi-agent deployments is bottlenecked by the need for realistic and reproducible testbeds. Existing work has made important progress using stylised games and simulated social environments (e.g., Akata et al., 2025; Gandhi et al., 2023; Park et al., 2023), though these settings face important limitations. We therefore seek testbeds that allow researchers to study populations of frontier-model agents interacting over extended periods with realistic tools, memory systems, economic constraints, and communication channels (Kapoor et al., 2026). While we are interested in seeing proposals that address this section alone, we expect such environments to be deliberately designed to support the comparative evaluation of the theoretical measures, protocols, and oversight mechanisms discussed in later sections.
Sandboxes and testbeds should be:
We also welcome proposals that are not focused on a particular testbed or environment itself, but contribute significantly to addressing one or more of the problems above, such as:
Please see the guidelines on research areas and out-of-scope topics.
Akata, Elif, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz (2025). "Playing Repeated Games with Large Language Models". Nature Human Behaviour (9), pp. 1380–1390.
Gandhi, Kanishk, Dorsa Sadigh, and Noah D. Goodman (2023). "Strategic Reasoning with Language Models". arXiv:2305.19165.
Kapoor, Sayash, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, et al. (2026). "Open-World Evaluations for Measuring Frontier AI Capabilities". arXiv:2605.20520.
Park, Joon Sung, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein (2023). "Generative Agents: Interactive Simulacra of Human Behavior". In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23).