This tutorial provides a simple introduction to using multi-agent reinforcement learning, assuming a little experience in machine learning and knowledge of Python.
A Brief Introduction to Reinforcement Learning
Reinforcement stems from using machine learning to optimally control an agent in an environment. It works by learning a policy, a function that maps an observation obtained from its environment to an action. Policy functions are typically deep neural networks, which gives rise to the name “deep reinforcement learning.”
The goal of reinforcement learning is to learn an optimal policy, a policy that achieves the maximum expected reward from the environment when acting. The reward is a single dimensionless value that is returned by the environment immediately after an action. The whole process can be visualized like this:
This paradigm of reinforcement learning encompasses and incredible variety of scenarios, like a character in a computer game (e.g. Atari where the reward is the change in score), a robot delivering food in a city (where the agent is rewarded positively for successfully completing a trip and penalized for taking too long), or a bot trading stocks (where the reward is money gained).
Multiagent Reinforcement Learning
Learning to play multiplayer games represents many of the most profound achievements of artificial intelligence in our lifetimes. These accomplishments include learning to play Go, DOTA 2, and StarCraft 2 to superhuman levels of performance. Using reinforcement learning to control multiple agents, unsurprisingly, is referred to as multi-agent reinforcement learning. In general it’s the same as single agent reinforcement learning, where each agent is trying to learn it’s own policy to optimize its own reward. Using a central policy for all agents is possible, but multiple agents would have to communicate with a central server to compute their actions (which is problematic in most real world scenarios), so in practice decentralized multi-agent reinforcement learning is used. This can be visualized as follows:
Multi-agent deep reinforcement learning, what we’ll be doing today, similarly just uses deep neural networks to represent the learned policies in multi-agent reinforcement learning.
Gym is a famous library in reinforcement learning developed by OpenAI that provides a standard API for environments so that they can be easily learned with different reinforcement learning codebases, and so that for the same learning code base different environments can be easily tried. PettingZoo is a newer library that’s like a multi-agent version of Gym. It’s basic API usage looks like this:
from pettingzoo.butterfly import pistonball_v4
env = pistonball_v4.env()
env.reset()
for agent in env.agent_iter():
observation, reward, done, info = env.last()
action = policy(observation, agent)
env.step(action)
The environment we’ll be learning today is Pistonball, a cooperative environment from PettingZoo:
In it, each piston is an agent that can be separately controlled by a policy. The observation is the space above and next to the piston, e.g:
The action the policy returns is the amount to raise or lower the piston (from -4 to 4 pixels). The goal is for the pistons to learn how to work together to roll the ball to the left wall as fast as possible. Each piston agent is rewarded negatively if the ball moves right, positively if the ball moves left, and receives a small amount of negative reward at every time step to incentivize moving to the left as fast as possible.
A plethora of techniques exist to learn a single agent environment in reinforcement learning. These serve as the basis for algorithms in multi-agent reinforcement learning. The simplest and most popular way to do this is to have a single policy network shared between all agents; this is often referred to simply as “parameter sharing”. That’s what we’ll be using today, with the PPO single agent method (one of the best methods for continuous control tasks like this).
First we begin with imports:
from stable_baselines.common.policies import CnnPolicyfrom stable_baselines import PPO2from pettingzoo.butterfly import pistonball_v3import supersuit as ss
PettingZoo we’ve already discussed, but let’s talk about Stable Baselines. A few years back OpenAI released the “baselines” repository which included implementations of most of the major deep reinforcement learning algorithms. This repository was turned into the Stable Baselines library intended for beginners and practitioners of reinforcement learning to easily use to learn Gym environments. The CnnPolicy in it is just a deep convolutional neural network object that Stable Baselines includes which automatically resizes the input and output layers of the neural network to adapt to the observation and action space of the environment. SuperSuit is a package that provides preprocessing functions for both Gym and PettingZoo environments, as we’ll see below. Environments and wrappers are versioned to ensure comparisons are precisely reproducible in academic research.
First, we initialize the PettingZoo environment:
env = pistonball_v3.parallel_env(n_pistons=20, local_ratio=0, time_penalty=-0.1, continuous=True, random_drop=True, random_rotate=True, ball_mass=0.75, ball_friction=0.3, ball_elasticity=1.5, max_cycles=125)
Each of those arguments control how the environment functions in various ways and is documented here. The alternative parallel_env mode we need to use here is documented here.
The first problem we have to deal with is that the environment’s observations are full color images. We don’t need the color information and it’s 3x more computationally expensive for the neural networks to process than grayscale images due to the 3 color channels. We can fix this by wrapping the environment with SuperSuit (remember we imported it as ss above) shown below:
env = ss.color_reduction_v0(env, mode=’B’)
Note that the B flag actually takes the Blue channel of the image instead of turning all the channels into grayscale to save processing time as this will be done hundreds of thousands of times during training. After this, observations will look like this:
Despite the observations for each piston being greyscale, the images are still very large and contain more information than we need. Let’s shrink them down; 84×84 is a popular size for this in reinforcement learning because it was used in a famous paper by DeepMind. Fixing that with SuperSuit looks like this:
env = ss.resize_v0(env, x_size=84, y_size=84)
After this, the observations will look something like this:
The last major thing we want to do is slightly odd at first. Because the ball is on motion, we want to give the policy network an easy way of seeing how fast it’s moving and accelerating. The simplest way to do that is to stack the past few frames together as the channels of each observation. Sticking 3 together gives enough information to compute acceleration, but 4 is more standards. This is how you do that with SuperSuit:
env = ss.frame_stack_v1(env, 3)
Next, we need to convert the environments API a tiny bit, which will cause Stable Baselines to do parameter sharing of the policy network on a multiagent environment (instead of learning a single-agent environment like normal). The details of this are beyond the scope of this tutorial, but are documented here for those who want to know more.
env = ss.pettingzoo_env_to_vec_env_v0(env)
Finally, we need to set the environment to run multiple versions of itself in parallel. Playing through the environment multiple times at once makes learning faster and is important to PPOs learning performance. SuperSuit offers many ways to do this and the one we want to use here is this:
env = ss.concat_vec_envs_v0(env, 8, num_cpus=4, base_class=’stable_baselines’)
8 refers to the number of times we’re duplicating the environment, and num_cpus is the number of CPU cores these will be run on. These are hyperparameters and you’re free to play around with these. In our experience running more than 2 environments per thread can get problematically slow, so keep that in mind.
Finally, we can get to some actual learning. This can be done pretty easily with Stable Baselines with two more lines of code:
model = PPO2(CnnPolicy, env, verbose=3, gamma=0.99, n_steps=125, ent_coef=0.01, learning_rate=0.00025, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange_vf=.2)model.learn(total_timesteps=2000000)model.save(“policy”)
This instantiates the PPO learning object, and then trains it. All the arguments are hyperparameters, which you can read about in great detail here. The timesteps argument in the .learn() method refers to actions taken by an individual agent, not the total number of times the game is played.
Training will take roughly 2 hours with a modern 8 core CPU and a 1080Ti (like all deep learning this is fairly GPU intensive). If you don’t have a GPU, training this on Google Cloud Platform with a T4 GPU should cost less than $2.
Watching Our Algorithm Play the Game
Once we’ve trained and save this model, we can load our policy and watch it play. First, let’s reinstantiate the environment, using the normal API this time:
env = pistonball_v3.env()
env = ss.color_reduction_v0(env, mode=’B’)
env = ss.resize_v0(env, x_size=84, y_size=84)
env = ss.frame_stack_v1(env, 3)
Then, let’s load the policy
model = PPO2.load(“policy”)
We can them use the policy to render it on your desktop as follows:
env.reset()
for agent in env.agent_iter():
obs, reward, done, info = env.last()
act = model.predict(obs) if not done else None
env.step(act)
env.render()
That should produce something like this gif:
Notice how this isn’t as clean as the gif shown at the beginning. This is because the gif at the beginning was generated with a hand made policy that performs something closer to the optimal policy, available here. We leave as an exercise to interested readers tuning the hyperparameters of PPO to achieve that level of performance.
The full code for this tutorial is available here. If you found this tutorial useful, consider starring the projects involved (PettingZoo, SuperSuit and Stable Baselines).