Cooperative AI

Reinforcement Learning (RL) is increasingly applied to social, data-limited environments where agents must interact with others without the luxury of online experimentation. However, current evaluations fail to capture this complexity: standard offline RL benchmarks are non-social. To bridge this gap, we introduce Molten Pot, a benchmark for offline mixed-motive social RL that spans five substrates from DeepMind's Melting Pot environment and roughly a terabyte of trajectory data. By testing agents across three levels of social complexity – ranging from static populations to zero-shot generalization across social shifts like role changes and reciprocity thresholds – we reveal a significant ‘social robustness gap’. Our findings show that while existing algorithms may achieve high returns, they often fail when social dynamics shift, establishing offline social evaluation as a critical and distinct challenge for the RL community.

We’re delighted to host this first seminar in our 'Fellows' Spotlight' series. Subscribe to our Google Calendar to stay up-to-date on all upcoming events.

Speakers
‍

Claude Formanek (University of Cape Town)

Discussants

Time

16:00-17:00 UTC 21 May 2026

Links

This seminar has now finished

Claude Formanek completed his undergraduate studies in Mathematics and Computer Science at the University of Cape Town (UCT), where he continued his academic journey to an MSc and eventually PhD. Under the supervision of Jonathan Shock, his doctoral research focused on Multi-Agent Reinforcement Learning (MARL), specifically exploring the challenges of learning from static datasets. During his PhD, Claude published two papers on Offline MARL at NeurIPS and one in the Journal of Data-centric Machine Learning Research (DMLR). He was awarded the Cooperative AI PhD Fellowship in January 2025 and submitted his thesis for examination in January 2026.

Fellows' Spotlight

We’re delighted to host this first seminar in our 'Fellows' Spotlight' series. Subscribe to our Google Calendar to stay up-to-date on all upcoming events.

Benchmarking Offline RL in Mixed-Motive Social Settings