Multi-Agent Reinforcement Learning
Reinforcement learning (RL) can be extended to multi-agent systems (MAS) by replacing the single agent MDP formulation with a more general Markov game, also known as a stochastic game (Shapley, 1953). A Markov game, introduced by Littman (1994) as an extension of the Markov decision process (MDP) formalisation, is a mathematical framework that uses game theory to reason about multiple interactive agents in a shared environment. It is formalised (Yang and Wang, 2021) by the tuple:
$$\left\langle N, \mathbb{S},\lbrace\mathbb{A}^i\rbrace_{i \in{1, \ldots, N}}, \mathcal{P},\lbrace\mathcal{R}^i\rbrace_{i \in{1, \ldots, N}}, \gamma\right\rangle$$
where:
- $N$ is the number of agents;
- $\mathbb{S}$ is the state space observed by all agents;
- $\mathbb{A}^i$ is the set of actions for agent $i$, and $\mathbf{A}:=\mathbb{A}^1 \times \cdots \times \mathbb{A}^N$ is the joint action set;
- $\mathcal{P}: \mathbb{S} \times \mathbf{A} \rightarrow \Delta(\mathbb{S})$ is the transition probability from state $s \in \mathbb{S}$ to $s’ \in \mathbb{S}$ given the joint action $\mathbf{a} \in \mathbf{A}$ at time $t$, where $\Delta(\mathbb{S})$ is the set of probability distributions over $\mathbb{S}$;
- $\mathcal{R}^i: \mathbb{S} \times \mathbf{A} \times \mathbf{S} \rightarrow \mathbb{R}$ is the reward function for agent $i$ after transitioning from state $s \in \mathbb{S}$ to $s’ \in \mathbb{S}$ given the joint action $\mathbf{a} \in \mathbf{A}$ at time $t$; and
- $\gamma \in [0,1]$ is the discount factor dictating the importance of future rewards.
The goal of each agent $i$ in a multi-agent reinforcement learning (MARL) scenario is to solve the Markov game by finding the behavioural policy, known as mixed strategy in game theory, $\pi^i \in \Pi^i: \mathbb{S} \rightarrow \Delta(\mathbb{A}^i)$ that maximises the discounted cumulative reward. Finding an optimal policy requires each agent to strategically consider the actions of other agents when determining its own policy (Yang and Wang, 2021). The type of game is determined by the reward structure (Gronauer and Diepold, 2022):
- In a fully-cooperative setting, all agents receive the same reward $\mathcal{R} = \mathcal{R}^i = \dots = \mathcal{R}^N$ function for state transitions, which encourages collaboration to maximise team performance. In a more general cooperative game, agents are encouraged to collaborate but do not share the same rewards.
- In a fully-competitive setting, also known as a zero-sum Markov game, the sum of rewards equals zero for any state transition, i.e., $\sum_{i=1}^{N} \mathcal{R}^i(s,a,s’) = 0, s \in \mathbb{S}, a \in \mathbf{A}$, which promotes maximisation of individual rewards. In a more general competitive game, agents are encouraged to beat their opponents but the sum of rewards does not necessarily equal zero.
- A mixed setting, also known as general-sum Markov game, is neither fully cooperative nor fully competitive, and imposes no restrictions on agent reward functions.
References
- Shapley, L.S., 1953. Stochastic Games*. Proceedings of the National Academy of Sciences [Online], 39(10), pp.1095–1100.
- Littman, M.L., 1994. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In Proceedings of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, pp.157–163.
- Yang, Y. and Wang, J., 2021. An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective [Online]. arXiv.
- Gronauer, S. and Diepold, K., 2022. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review [Online], 55(2), pp.895–943.