Multi-agent reinforcement learning

[[File:Magent-graph-2.gif|thumb|Two rival teams of agents face off in a MARL experiment.]]

'''Multi-agent reinforcement learning''' ('''MARL''') is a sub-field of [[reinforcement learning]]. It focuses on studying the behavior of multiple learning agents that coexist in a shared environment.<ref>Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer. ''Multi-Agent Reinforcement Learning: Foundations and Modern Approaches.'' MIT Press, 2024. https://www.marl-book.com/</ref> Each agent is motivated by its own rewards, and does actions to advance its own interests; in some environments these interests are opposed to the interests of other agents, resulting in complex [[group dynamics]].

Multi-agent reinforcement learning is closely related to [[game theory]] and especially [[repeated game|repeated games]], as well as [[Multi-agent system|multi-agent systems]]. Its study combines the pursuit of finding ideal algorithms that maximize rewards with a more sociological set of concepts. While research in single-agent reinforcement learning is concerned with finding the algorithm that gets the biggest number of points for one agent, research in multi-agent reinforcement learning evaluates and quantifies social metrics, such as cooperation,<ref>{{cite arXiv | last1 = Lowe | first1 = Ryan | last2 = Wu | first2 = Yi | title = Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments | eprint = 1706.02275v4 | year = 2020 | class = cs.LG }}</ref> reciprocity,<ref>{{cite conference | last = Baker | first = Bowen | title = Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences | book-title = NeurIPS 2020 proceedings | year = 2020 | eprint = 2011.05373}}</ref> equity,<ref name="Hughes 2018 inequity">{{cite conference | last1 = Hughes | first1 = Edward | last2 = Leibo | first2 = Joel Z. | display-authors=etal | title = Inequity aversion improves cooperation in intertemporal social dilemmas | book-title = NeurIPS 2018 proceedings | year = 2018 | eprint = 1803.08884}}</ref> social influence,<ref>{{cite conference | last1 = Jaques | first1 = Natasha | last2 = Lazaridou | first2 = Angeliki | last3 = Hughes | first3 = Edward | display-authors=etal | title = Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning | book-title = Proceedings of the 35th International Conference on Machine Learning | year = 2019 | eprint = 1810.08647 }}</ref> language<ref>{{cite conference | last = Lazaridou | first = Angeliki | title = Multi-Agent Cooperation and The Emergence of (Natural) Language | book-title = ICLR 2017 | year = 2017 | eprint = 1612.07182}}</ref> and discrimination.<ref>{{cite arXiv | last = Duéñez-Guzmán | first = Edgar | display-authors=etal | title = Statistical discrimination in learning agents | eprint = 2110.11404v1 | year = 2021 | class = cs.LG }}</ref>

== Definition ==

Similarly to [[Reinforcement learning|single-agent reinforcement learning]], multi-agent reinforcement learning is modeled as some form of a [[Markov decision process|Markov decision process (MDP)]]. Fix a set of agents <math>I = \{1, ..., N\}</math>. We then define:

* A set <math>S</math> of environment states. * One set <math>\mathcal A_i</math> of actions for each of the agents <math>i \in I = \{1, \dots, N\}</math>. * <math>P_\vec{a}(s,s')=\Pr(s_{t+1}=s' \mid s_t=s, \vec{a}_t=\vec{a})</math> is the probability of transition (at time <math>t</math>) from state <math>s</math> to state <math>s'</math> under joint action <math>\vec{a}</math>. * <math>\vec{R}_\vec{a}(s,s')</math> is the immediate joint reward after the transition from <math>s</math> to <math>s'</math> with joint action <math>\vec{a}</math>.

In settings with [[perfect information]], such as the games of [[chess]] and [[Go (game)|Go]], the MDP would be fully observable. In settings with imperfect information, especially in real-world applications like [[self-driving car|self-driving cars]], each agent would access an observation that only has part of the information about the current state. In the partially observable setting, the core model is the partially observable [[stochastic game]] in the general case, and the [[decentralized partially observable Markov decision process|decentralized POMDP]] in the cooperative case.

== Cooperation vs. competition ==

When multiple agents are acting in a shared environment their interests might be aligned or misaligned. MARL allows exploring all the different alignments and how they affect the agents' behavior:

* In [[#Pure competition settings|pure competition settings]], the agents' rewards are exactly opposite to each other, and therefore they are playing ''against'' each other. * [[#Pure cooperation settings|Pure cooperation settings]] are the other extreme, in which agents get the exact same rewards, and therefore they are playing ''with'' each other. * [[#Mixed-sum settings|Mixed-sum settings]] cover all the games that combine elements of both cooperation and competition.

=== Pure competition settings ===

When two agents are playing a [[zero-sum game]], they are in pure competition with each other. Many traditional games such as [[chess]] and [[Go (game)|Go]] fall under this category, as do two-player variants of video games like [[StarCraft]]. Because each agent can only win at the expense of the other agent, many complexities are stripped away. There is no prospect of communication or social dilemmas, as neither agent is incentivized to take actions that benefit its opponent.

The [[Deep Blue (chess computer)|Deep Blue]]<ref>{{Cite journal |last1=Campbell |first1=Murray |last2=Hoane |first2=A. Joseph Jr. |last3=Hsu |first3=Feng-hsiung |year=2002 |title=Deep Blue |journal=Artificial Intelligence |volume=134 |issue=1–2 |pages=57–83 |doi=10.1016/S0004-3702(01)00129-1 |issn=0004-3702 |publisher=Elsevier}}</ref> and [[AlphaGo]] projects demonstrate how to optimize the performance of agents in pure competition settings.

One complexity that is not stripped away in pure competition settings is [[#Autocurricula|autocurricula]]. As the agents' policy is improved using [[Self-play (reinforcement learning technique)|self-play]], multiple layers of learning may occur.

=== Pure cooperation settings ===

MARL is used to explore how separate agents with identical interests can communicate and work together. Pure cooperation settings are explored in recreational [[Cooperative video game|cooperative games]] such as [[Overcooked]],<ref>{{Cite arXiv | first = Micah | last = Carroll | display-authors=etal | eprint = 1910.05789 | title = On the Utility of Learning about Humans for Human-AI Coordination | year = 2019| class = cs.LG }}</ref> as well as real-world scenarios in [[robotics]].<ref>{{cite conference |last1=Xie |first1=Annie |last2=Losey |first2=Dylan |last3=Tolsma |first3=Ryan |last4=Finn |first4=Chelsea |author-link4=Chelsea Finn |last5=Sadigh |first5=Dorsa |date=November 2020 |title=Learning Latent Representations to Influence Multi-Agent Interaction |url=https://iliad.stanford.edu/pdfs/publications/xie2020learning.pdf |conference=CoRL}}</ref>

In pure cooperation settings all the agents get identical rewards, which means that social dilemmas do not occur.

In pure cooperation settings, oftentimes there are an arbitrary number of coordination strategies, and agents converge to specific "conventions" when coordinating with each other. The notion of conventions has been studied in language<ref>{{cite journal |last1=Clark |first1=Herbert |last2=Wilkes-Gibbs |first2=Deanna |title=Referring as a collaborative process |journal=Cognition |date=February 1986 |volume=22 |issue=1 |pages=1–39 |doi=10.1016/0010-0277(86)90010-7 |pmid=3709088 |s2cid=204981390 |doi-access=free }}</ref> and also alluded to in more general multi-agent collaborative tasks.<ref>{{cite journal |url=https://dl.acm.org/doi/10.5555/1029693.1029710 |title=Planning, learning and coordination in multiagent decision processes | journal=Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge | last1=Boutilier |first1=Craig |date=17 March 1996 |pages=195–210}}</ref><ref>{{cite conference |url=https://www.cs.utexas.edu/~pstone/Papers/bib2html/b2hd-AAAI10-adhoc.html |title=Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination|first1=Peter|last1=Stone|first2=Gal A.|last2=Kaminka|first3=Sarit|last3=Kraus|first4=Jeffrey S.|last4=Rosenschein|date=July 2010 |conference=AAAI 11}}</ref><ref>{{cite conference |title=Bayesian action decoder for deep multi-agent reinforcement learning | first1=Jakob N.|last1=Foerster|first2=H. Francis|last2=Song|first3=Edward|last3=Hughes|first4=Neil|last4=Burch|first5=Iain|last5=Dunning|first6=Shimon|last6=Whiteson|first7=Matthew M|last7=Botvinick|first8=Michael H.|last8=Bowling|arxiv=1811.01458 |conference=ICML 2019}}</ref><ref> {{cite conference |title=On the Critical Role of Conventions in Adaptive Human-AI Collaboration|first1=Andy|last1=Shih|first2=Arjun|last2=Sawhney|first3=Jovana|last3=Kondic|first4=Stefano|last4=Ermon|first5=Dorsa|last5=Sadigh|arxiv=2104.02871 |conference=ICLR 2021}}</ref>

=== Mixed-sum settings ===

[[File:Multi give way (4 agents, each trying to reach a specific point).gif|thumb|In this mixed sum setting, each of the four agents is trying to reach a different goal. Each agent's success depends on the other agents clearing its way, even though they are not directly incentivized to assist each other.<ref>{{Cite journal |last1= Bettini |first1= Matteo |last2=Kortvelesy |first2= Ryan|last3= Blumenkamp|first3= Jan|last4= Prorok|first4= Amanda|year=2022 |title=VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning |journal=The 16th International Symposium on Distributed Autonomous Robotic Systems |publisher=Springer|arxiv= 2207.03530 }}</ref>]]

Most real-world scenarios involving multiple agents have elements of both cooperation and competition. For example, when multiple [[self-driving cars]] are planning their respective paths, each of them has interests that are diverging but not exclusive: Each car is minimizing the amount of time it's taking to reach its destination, but all cars have the shared interest of avoiding a [[traffic collision]].<ref>{{Cite arXiv | eprint = 1610.03295 | title = Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving | year = 2016 | last1 = Shalev-Shwartz | first1 = Shai | last2 = Shammah | first2 = Shaked | last3 = Shashua | first3 = Amnon | class = cs.AI }}</ref>

Zero-sum settings with three or more agents often exhibit similar properties to mixed-sum settings, since each pair of agents might have a non-zero utility sum between them.

Mixed-sum settings can be explored using classic [[normal-form game|matrix games]] such as [[prisoner's dilemma]], more complex [[#Sequential social dilemmas|sequential social dilemmas]], and recreational games such as [[Among Us]],<ref>{{cite arXiv | last1 = Kopparapu | first1 = Kavya | last2 = Duéñez-Guzmán | first2 = Edgar A. | last3 = Matyas | first3 = Jayd | last4 = Vezhnevets | first4 = Alexander Sasha | last5 = Agapiou | first5 = John P. | last6 = McKee | first6 = Kevin R. | last7 = Everett | first7 = Richard | last8 = Marecki | first8 = Janusz | last9 = Leibo | first9 = Joel Z. | last10 = Graepel | first10 = Thore | eprint = 2201.01816 | title = Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria | year = 2022 | class = cs.AI }}</ref> [[Diplomacy (game)|Diplomacy]]<ref>{{Cite journal |last1= Bakhtin|first1= Anton |last2=Brown|first2= Noam |title=Human-level play in the game of Diplomacy by combining language models with strategic reasoning |journal=Science |year= 2022 |volume= 378 |issue= 6624 |pages= 1067–1074 |publisher=Springer|doi= 10.1126/science.ade9097 |pmid= 36413172 |bibcode= 2022Sci...378.1067M |s2cid= 253759631 |url=https://www.science.org/doi/abs/10.1126/science.ade9097 | display-authors=etal|url-access= subscription }}</ref> and [[StarCraft II]].<ref>{{cite arXiv | first1 = Mikayel | last1 = Samvelyan | first2 = Tabish | last2 = Rashid | first3 = Christian Schroeder | last3 = de Witt | first4 = Gregory | last4 = Farquhar | first5 = Nantas | last5 = Nardelli | first6 = Tim G. J. | last6 = Rudner | first7 = Chia-Man | last7 = Hung | first8 = Philip H. S. | last8 = Torr | first9 = Jakob | last9 = Foerster | first10 = Shimon | last10 = Whiteson | eprint = 1902.04043 | title = The StarCraft Multi-Agent Challenge | year = 2019| class = cs.LG }}</ref><ref>{{cite arXiv | last1 = Ellis | first1 = Benjamin | last2 = Moalla | first2 = Skander | last3 = Samvelyan | first3 = Mikayel | last4 = Sun | first4 = Mingfei | last5 = Mahajan | first5 = Anuj | last6 = Foerster | first6 = Jakob N. | last7 = Whiteson | first7 = Shimon | eprint = 2212.07489 | title = SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning | year = 2022 | class = cs.LG }}</ref>

Mixed-sum settings can give rise to communication and social dilemmas.

== Social dilemmas == As in [[game theory]], much of the research in MARL revolves around [[Collective action problem|social dilemmas]], such as [[prisoner's dilemma]],<ref>{{cite journal | last1 = Sandholm | first1 = Toumas W. | last2 = Crites | first2 = Robert H. | title = Multiagent reinforcement learning in the Iterated Prisoner's Dilemma | journal=Biosystems | year = 1996 | volume = 37 | issue = 1–2 | pages = 147–166 | doi = 10.1016/0303-2647(95)01551-5 | pmid = 8924633 | bibcode = 1996BiSys..37..147S }}</ref> [[Chicken (game)|chicken]] and [[stag hunt]].<ref>{{cite conference | first1 = Alexander | last1 = Peysakhovich | first2 = Adam | last2 = Lerer | title = Prosocial Learning Agents Solve Generalized Stag Hunts Better than Selfish Ones | book-title = AAMAS 2018 | eprint = 1709.02865 | year = 2018 }}</ref>

While game theory research might focus on [[Nash equilibrium|Nash equilibria]] and what an ideal policy for an agent would be, MARL research focuses on how the agents would learn these ideal policies using a trial-and-error process. The [[reinforcement learning]] algorithms that are used to train the agents are maximizing the agent's own reward; the conflict between the needs of the agents and the needs of the group is a subject of active research.<ref>{{cite conference | last1 = Dafoe | first1 = Allan | last2 = Hughes | first2 = Edward | last3 = Bachrach | first3 = Yoram | display-authors=etal | title = Open Problems in Cooperative AI | book-title = NeurIPS 2020 | eprint = 2012.08630 | year = 2020 }}</ref>

Various techniques have been explored in order to induce cooperation in agents: Modifying the environment rules,<ref>{{cite conference | first1 = Raphael | last1 = Köster | first2 = Dylan | last2 = Hadfield-Menell | first3 = Gillian K. | last3 = Hadfield | first4 = Joel Z. | last4 = Leibo | title = Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors | book-title = AAMAS 2020 | eprint = 2001.09318 }}</ref> adding intrinsic rewards,<ref name="Hughes 2018 inequity" /> and more.

=== Sequential social dilemmas === Social dilemmas like prisoner's dilemma, chicken and stag hunt are "matrix games". Each agent takes only one action from a choice of two possible actions, and a simple 2x2 matrix is used to describe the reward that each agent will get, given the actions that each agent took.

In humans and other living creatures, social dilemmas tend to be more complex. Agents take multiple actions over time, and the distinction between cooperating and defecting is not as clear cut as in matrix games. The concept of a '''sequential social dilemma (SSD)''' was introduced in 2017<ref>{{cite conference | first1 = Joel Z. | last1 = Leibo | first2 = Vinicius | last2 = Zambaldi | first3 = Marc | last3 = Lanctot | first4 = Janusz | last4 = Marecki | first5 = Thore | last5 = Graepel | title = Multi-agent Reinforcement Learning in Sequential Social Dilemmas | book-title = AAMAS 2017 | eprint = 1702.03037 | year = 2017 }}</ref> as an attempt to model that complexity. There is ongoing research into defining different kinds of SSDs and showing cooperative behavior in the agents that act in them.<ref>{{cite arXiv | first1 = Pinkesh | last1 = Badjatiya | first2 = Mausoom | last2 = Sarkar | eprint = 2001.05458 | title = Inducing Cooperative behaviour in Sequential-Social dilemmas through Multi-Agent Reinforcement Learning using Status-Quo Loss | year = 2020 | class = cs.AI }}</ref>

== Autocurricula == An autocurriculum<ref>{{cite arXiv | last1 = Leibo | first1 = Joel Z. | last2 = Hughes | first2 = Edward | display-authors=etal | title = Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research | year = 2019 | class = cs.AI | eprint = 1903.00742v2 }}</ref> (plural: autocurricula) is a reinforcement learning concept that's salient in multi-agent experiments. As agents improve their performance, they change their environment; this change in the environment affects themselves and the other agents. The feedback loop results in several distinct phases of learning, each depending on the previous one. The stacked layers of learning are called an autocurriculum. Autocurricula are especially apparent in adversarial settings,<ref>{{cite conference | last = Baker | first = Bowen | display-authors=etal | title = Emergent Tool Use From Multi-Agent Autocurricula | book-title = ICLR 2020 | year = 2020 | eprint = 1909.07528}}</ref> where each group of agents is racing to counter the current strategy of the opposing group.

The [https://www.youtube.com/watch?v=kopoLzvh5jY Hide and Seek game] is an accessible example of an autocurriculum occurring in an adversarial setting. In this experiment, a team of seekers is competing against a team of hiders. Whenever one of the teams learns a new strategy, the opposing team adapts its strategy to give the best possible counter. When the hiders learn to use boxes to build a shelter, the seekers respond by learning to use a ramp to break into that shelter. The hiders respond by locking the ramps, making them unavailable for the seekers to use. The seekers then respond by "box surfing", exploiting a [[Glitching|glitch]] in the game to penetrate the shelter. Each "level" of learning is an emergent phenomenon, with the previous level as its premise. This results in a stack of behaviors, each dependent on its predecessor.

Autocurricula in reinforcement learning experiments are compared to the stages of the [[Evolution|evolution of life on Earth]] and the development of [[Culture|human culture]]. A major stage in evolution happened 2-3 billion years ago, when [[Photosynthesis|photosynthesizing life forms]] started to produce massive amounts of [[oxygen]], changing the balance of gases in the atmosphere.<ref>{{cite journal |last1=Kasting |first1=James F |last2=Siefert |first2=Janet L |title=Life and the evolution of earth's atmosphere |journal=Science |date=2002 |volume=296 |issue=5570 |pages=1066–1068 |doi=10.1126/science.1071184 |pmid=12004117 |bibcode=2002Sci...296.1066K |s2cid=37190778 }}</ref> In the next stages of evolution, oxygen-breathing life forms evolved, eventually leading up to land [[Mammal|mammals]] and human beings. These later stages could only happen after the photosynthesis stage made oxygen widely available. Similarly, human culture could not have gone through the [[Industrial Revolution]] in the 18th century without the resources and insights gained by the [[Neolithic Revolution|agricultural revolution]] at around 10,000 BC.<ref>{{cite book |last1=Clark |first1=Gregory |title=A farewell to alms: a brief economic history of the world |date=2008 |publisher=Princeton University Press |isbn=978-0-691-14128-2}}</ref>

== Applications ==

Multi-agent reinforcement learning has been applied to a variety of use cases in science and industry:

{{columns-list|colwidth=18em|

* [[Broadband]] [[cellular networks]] such as [[5G]]<ref name="Li 2022" /> * [[Cache (computing)|Content caching]]<ref name="Li 2022" /> * [[Routing|Packet routing]]<ref name="Li 2022" /> * [[Computer vision]]<ref>{{cite arXiv | first1 = Ngan | last1 = Le | first2 = Vidhiwar Singh | last2 = Rathour | first3 = Kashu | last3 = Yamazaki | first4 = Khoa | last4 = Luu | first5 = Marios | last5 = Savvides | eprint = 2108.11510 | title = Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey | year = 2021| class = cs.CV }}</ref> * [[Network security]]<ref name="Li 2022" /> * [[Power control#Transmit power control|Transmit power control]]<ref name="Li 2022" /> * [[Computation offloading]]<ref name="Li 2022" /> * [[Evolution of languages|Language evolution research]]<ref>{{cite arXiv | first1 = Clément | last1 = Moulin-Frier | first2 = Pierre-Yves | last2 = Oudeyer | eprint = 2002.08878 | title = Multi-Agent Reinforcement Learning as a Computational Tool for Language Evolution Research: Historical Context and Future Challenges | year = 2020| class = cs.MA }}</ref> * [[Global health]]<ref>{{cite conference | last1 = Killian | first1 = Jackson | last2 = Xu | first2 = Lily | last3 = Biswas | first3 = Arpita | last4 = Verma | first4 = Shresth | display-authors=etal | title = Robust Planning over Restless Groups: Engagement Interventions for a Large-Scale Maternal Telehealth Program | conference = AAAI | year = 2023 }}</ref> * [[Integrated circuit design]]<ref>{{cite arXiv | first1 = Srivatsan | last1 = Krishnan | first2 = Natasha | last2 = Jaques | first3 = Shayegan | last3 = Omidshafiei | first4 = Dan | last4 = Zhang | first5 = Izzeddin | last5 = Gur | first6 = Vijay Janapa | last6 = Reddi | first7 = Aleksandra | last7 = Faust | eprint = 2211.16385 | title = Multi-Agent Reinforcement Learning for Microprocessor Design Space Exploration | year = 2022| class = cs.AR }}</ref> * [[Internet of Things]]<ref name="Li 2022">{{cite arXiv | first1 = Tianxu | last1 = Li | first2 = Kun | last2 = Zhu | first3 = Nguyen Cong | last3 = Luong | first4 = Dusit | last4 = Niyato | first5 = Qihui | last5 = Wu | first6 = Yang | last6 = Zhang | first7 = Bing | last7 = Chen | eprint = 2110.13484 | title = Applications of Multi-Agent Reinforcement Learning in Future Internet: A Comprehensive Survey | year = 2021| class = cs.AI }}</ref> * [[Microgrid]] [[energy management]]<ref>{{cite journal | first1 = Yuanzheng | last1 = Li | first2 = Shangyang | last2 = He | first3 = Yang | last3 = Li | first4 = Yang | last4 = Shi | first5 = Zhigang | last5 = Zeng | arxiv = 2301.00641 | title = Federated Multiagent Deep Reinforcement Learning Approach via Physics-Informed Reward for Multimicrogrid Energy Management | journal = IEEE Transactions on Neural Networks and Learning Systems | year = 2023| volume = PP | issue = 5 | pages = 5902–5914 | doi = 10.1109/TNNLS.2022.3232630 | pmid = 37018258 | s2cid = 255372287 }}</ref> * Multi-camera control<ref>{{cite conference | title=Proactive Multi-Camera Collaboration for 3D Human Pose Estimation | last1=Ci | first1=Hai | last2=Liu | first2=Mickel | last3=Pan | first3=Xuehai | last4=Zhong | first4=Fangwei | last5=Wang | first5=Yizhou | conference=International Conference on Learning Representations | year=2023 | url=https://openreview.net/forum?id=CPIy9TWFYBG }}</ref> * [[Self-driving car|Autonomous vehicles]]<ref>{{cite conference | title=Benchmarks for reinforcement learning in mixed-autonomy traffic | last1=Vinitsky | first1=Eugene | last2=Kreidieh | first2=Aboudy | last3=Le Flem | first3=Luc | last4=Kheterpal | first4=Nishant | last5=Jang | first5=Kathy | last6=Wu | first6=Fangyu | last7=Liaw | first7=Richard | last8=Liang | first8=Eric | last9=Bayen | first9=Alexandre M. | conference=Conference on Robot Learning | year=2018 | url=http://proceedings.mlr.press/v87/vinitsky18a/vinitsky18a.pdf }}</ref> * [[Sports analytics]]<ref>{{cite arXiv | first1 = Karl | last1 = Tuyls | first2 = Shayegan | last2 = Omidshafiei | first3 = Paul | last3 = Muller | first4 = Zhe | last4 = Wang | first5 = Jerome | last5 = Connor | first6 = Daniel | last6 = Hennes | first7 = Ian | last7 = Graham | first8 = William | last8 = Spearman | first9 = Tim | last9 = Waskett | first10 = Dafydd | last10 = Steele | first11 = Pauline | last11 = Luc | first12 = Adria | last12 = Recasens | first13 = Alexandre | last13 = Galashov | first14 = Gregory | last14 = Thornton | first15 = Romuald | last15 = Elie | first16 = Pablo | last16 = Sprechmann | first17 = Pol | last17 = Moreno | first18 = Kris | last18 = Cao | first19 = Marta | last19 = Garnelo | first20 = Praneet | last20 = Dutta | first21 = Michal | last21 = Valko | first22 = Nicolas | last22 = Heess | first23 = Alex | last23 = Bridgland | first24 = Julien | last24 = Perolat | first25 = Bart | last25 = De Vylder | first26 = Ali | last26 = Eslami | first27 = Mark | last27 = Rowland | first28 = Andrew | last28 = Jaegle | first29 = Remi | last29 = Munos | first30 = Trevor | last30 = Back | first31 = Razia | last31 = Ahamed | first32 = Simon | last32 = Bouton | first33 = Nathalie | last33 = Beauguerlange | first34 = Jackson | last34 = Broshear | first35 = Thore | last35 = Graepel | first36 = Demis | last36 = Hassabis | eprint = 2011.09192 | title = Game Plan: What AI can do for Football, and What Football can do for AI | year = 2020| class = cs.AI }}</ref> * [[Traffic control]]<ref>{{cite journal | first1 = Tianshu | last1 = Chu | first2 = Jie | last2 = Wang | first3 = Lara | last3 = Codec├á | first4 = Zhaojian | last4 = Li | arxiv = 1903.04527 | title = Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal Control | journal = IEEE Transactions on Intelligent Transportation Systems | year = 2019| volume = 21 | issue = 3 | page = 1086 | doi = 10.1109/TITS.2019.2901791 | bibcode = 2020ITITr..21.1086C }}</ref> ([[Ramp metering]]<ref>{{cite arXiv | first1 = Francois | last1 = Belletti | first2 = Daniel | last2 = Haziza | first3 = Gabriel | last3 = Gomes | first4 = Alexandre M. | last4 = Bayen | eprint = 1701.08832 | title = Expert Level control of Ramp Metering based on Multi-task Deep Reinforcement Learning | year = 2017| class = cs.AI }}</ref>) * [[Unmanned aerial vehicles]]<ref>{{cite arXiv | first1 = Yahao | last1 = Ding | first2 = Zhaohui | last2 = Yang | first3 = Quoc-Viet | last3 = Pham | first4 = Zhaoyang | last4 = Zhang | first5 = Mohammad | last5 = Shikh-Bahaei | eprint = 2301.00912 | title = Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics | year = 2023| class = cs.LG }}</ref><ref name="Li 2022" /> * [[Wildlife conservation]]<ref>{{cite arXiv | first1 = Lily | last1 = Xu | first2 = Andrew | last2 = Perrault | first3 = Fei | last3 = Fang | first4 = Haipeng | last4 = Chen | first5 = Milind | last5 = Tambe | eprint = 2106.08413 | title = Robust Reinforcement Learning Under Minimax Regret for Green Security | year = 2021| class = cs.LG }}</ref>

}}

=== AI alignment === Multi-agent reinforcement learning has been used in research into [[AI alignment]]. The relationship between the different agents in a MARL setting can be compared to the relationship between a human and an AI agent. Research efforts in the intersection of these two fields attempt to simulate possible conflicts between a human's intentions and an AI agent's actions, and then explore which variables could be changed to prevent these conflicts.<ref>{{cite arXiv | eprint = 1711.09883 | last1 = Leike | first1 = Jan | last2 = Martic | first2 = Miljan | last3 = Krakovna | first3 = Victoria | last4 = Ortega | first4 = Pedro A. | last5 = Everitt | first5 = Tom | last6 = Lefrancq | first6 = Andrew | last7 = Orseau | first7 = Laurent | last8 = Legg | first8 = Shane | year = 2017 | title = AI Safety Gridworlds | class = cs.AI }}</ref><ref>{{cite arXiv | eprint = 1611.08219 | last1 = Hadfield-Menell | first1 = Dylan | last2 = Dragan | first2 = Anca | last3 = Abbeel | first3 = Pieter | last4 = Russell | first4 = Stuart | class = cs.AI | title = The Off-Switch Game | year = 2016 }}</ref>

== Limitations == There are some inherent difficulties about multi-agent [[deep reinforcement learning]].<ref>{{Cite journal|last1=Hernandez-Leal|first1=Pablo|last2=Kartal|first2=Bilal|last3=Taylor|first3=Matthew E.|date=2019-11-01|title=A survey and critique of multiagent deep reinforcement learning|journal=Autonomous Agents and Multi-Agent Systems|language=en|volume=33|issue=6|pages=750–797|doi=10.1007/s10458-019-09421-1|issn=1573-7454|arxiv=1810.05587|s2cid=52981002}}</ref> The environment is not stationary anymore, thus the [[Markov property]] is violated: transitions and rewards do not only depend on the current state of an agent.

== Further reading == {{Scholia|topic}} * Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer. ''Multi-Agent Reinforcement Learning: Foundations and Modern Approaches''. MIT Press, 2024. [https://www.marl-book.com/ https://www.marl-book.com] * Kaiqing Zhang, Zhuoran Yang, Tamer Basar. ''Multi-agent reinforcement learning: A selective overview of theories and algorithms''. Studies in Systems, Decision and Control, Handbook on RL and Control, 2021. [https://link.springer.com/chapter/10.1007/978-3-030-60990-0_12] * {{cite arXiv | first1 = Yaodong | last1 = Yang | first2 = Jun | last2 = Wang | title = An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective | eprint = 2011.00583 | year = 2020 | class = cs.MA }}

== References == {{reflist}}

[[Category:Reinforcement learning]] [[Category:Multi-agent systems]] [[Category:Deep learning]]