The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system. (Forrester 1971)
In this article, I want to introduce and write about the paper World Models by David Ha and Jürgen Schmidhuber.
In our daily life, we are confronted with tons of information from our world around us streaming in through our different senses. Since we are not able to process everything in detail that we see, smell, feel, or hear, our brain learns abstract representations. These representations cover spatial and temporal aspects and help us to navigate and interact in our world.
Based on these representations we build our own model of the world surrounding us. Important to know is that for each person this world is different because of various and diverse experiences, feelings, and situations that each person has lived through.
However, for all of us, these models that we create subconsciously help us significantly in our daily life. Thus in a physical way of how we move around and interact with the environment (reflexes) but also mentally giving us hints on how things could work out (intuition).
For example, if we are walking on a flat road, but it is interrupted by a construction site and instead of the flat and neat road, there are now pebbles of different sizes on which we have to walk. Because we have often experienced such situations and our model includes such situations, our movements adapt automatically and instinctively without much thought or planning.
That means our model of the world makes it possible to have constant small future predictions of the sensory data that we observe and lets us react instinctively and adapt our “motor actions” without a need to plan based on the predictive model.
Another great example is trying to hit a flying ball be it while playing table tennis or baseball. When we have played table tennis for a while we know exactly how the ball will fly around and behaves based on the physics of the world. Thus our own model is quite accurate about the behaviors of the ball and how to react to it.
So in a professional match, when the ball is flying towards a player at up to 170 km/h, the player doesn’t think twice and plans how to stand, how to hold his hand and racket to hit the ball perfectly. All this is done instinctively with the internal predictions where the ball will go based on the model that has been learned and developed.
Now how can we make use of such world models for our reinforcement learning agent? Typically recurrent neural networks (RNN) are highly expressive models that are capable to learn rich spatial and temporal representations of data.
However, models that can capture information about complicated and high-resolution environments need to be huge with a lot of parameters. In comparison, in Model-Free RL, Neural Networks are used to learn and represent a policy or a value-function. But usually, those networks are rather small since the algorithms are often bottlenecked by the credit assignment problem, which makes it hard for traditional RL algorithms to learn millions of weights of a large model.
The authors of the paper make the proposal to
- 1. Learn a big model of the environment in an unsupervised manner to learn a compressed spatial and temporal representation.
- 2. Once learned we can train a (small controller) model with the help of the learned model.
But let’s have a look at how this might be done in detail…
In general, the agent can be divided into three sub-models. These are the V-Model, the M-Model, and the C-Model. Thereby, V-Model and M-Model can be defined as the World Model. The C-Model, on the other hand, is a simple controller for the decision-making process and the interaction with the environment. But let’s have a more in-depth look at each of these Models.
V-Model
When interacting with the environment, at each step the agent sees one 2D image frame which is part of a video sequence. The task of the V-Model is simply to compress what the agent sees at each frame into a small representative code.
There are different methods for the job of encoding observations like PCA or Auto-Encoder. In the paper, the authors used a simple Variational Autoencoder (VAE) for this task to compress each image frame into a small latent vector z.
M-Model or MDN-RNN
The V-Model helped to compress what the agent sees at each time step, however, we also want to compress what happens over time. For this purpose, the role of the M-Model is to predict the future. It serves as a predictive model of the future z vectors that the V-Model is expected to produce.
The M-Model is represented by an RNN which outputs a probability density function p(z). Since most complex environments have a stochastic nature, this probability function p(z) captures the stochasticity better than a deterministic prediction of z.
Hereby, p(z) is approximated as a mixture of Gaussian distribution and the RNN is trained to output the probability distribution of the next latent vector z_t+1 given the current and past information made available to it. Further, a temperature parameter τ is added during sampling, to control model uncertainty. Overall this whole M-Model is also known as a Mixture Density Network (MDN) — here combined with an RNN creating an MDN-RNN.
C-Model
As the decision-making module, the task of the C-Model is to maximize the expected cumulative reward of the agent during a rollout of the environment. Since this is the only task of the C-Model the authors tried to make it as simple and small as possible. Being represented by a simple single-layer linear model with maps the latent vector z and the hidden state h directly to an action a.
Combining z with h gives the controller C a good representation of both the current observation and what to expect in the future.
Also, since the C-Model is rather simple it is trained separately from the V and M-Model so that most of the agent’s complexity resides in the World Model. The simplicity and separation of the C-Model while training further allows exploring more unconventional ways to train C. That’s why in the paper, the authors used evolution strategies (ES) for optimizing the weights of C.
Putting it all together
V&M-Model or the World Model are trained together with backpropagation and create an encoded latent vector of the environment observation. Combined with the hidden state of the MDN-RNN, the future prediction, the C-Model decides what action to take. This action is then passed to the environment and executed.
Evaluation of the World Model Agent
The authors first evaluated the agent on the CarRacing-v0 Environment of OpenAI Gym, executing the described steps above. However, for the first experiment, they only used the V-Model and just fed the latent vector to the C-Model.
With that, the agent was able to navigate on the race track but missed out on sharp corners as seen in the image above. Resulting in a score of 632 over 100 trials and thus in line with other performances like A3C. Making the C-Model deeper and adding additional hidden layers resulted in a small improvement to a score of 788. But still not quite enough to solve the environment, which is at an average score of over 900 over 100 consecutive trials.
In the following, they tested the full World Model agent with the M-Model. Since the agent is trained for only one thing, which is to predict the future latent vector z_t+1, the hidden state (h) is a good candidate for the set of learned features we can give to our agent as additional information. Because the hidden state incapsulates, what the agent has to expect in the next state, thereby giving very important information for the decision process. Acting as a sort of intuition about what might happen.
The results show, that when the agent has access to both z and h greatly improves its driving capability. The driving is more stable, and the agent is able to seemingly attack the sharp corners effectively.
Furthermore, we see that in making these fast reflexive driving decisions during a car race the agent does not need to plan ahead and roll out hypothetical scenarios of the future. Since h_t contain information about the probability distribution of the future, the agent can just query the RNN instinctively to guide its action decisions.
This led to the final score of 906, effectively solving the task and obtaining new state of the art results.
One interesting fact about our World Model is that we can now model the future. This makes it possible to come up with hypothetical car racing scenarios on our own. Meaning that we could put the agent into this hallucinated environment generated by the M-Model and let it interact with it.
This begs the question — can we train our agent to learn inside of its own dream, and transfer this policy back to the actual environment?
Accordingly, this was an obvious next step that the authors also tested. Let’s look at how they did it and what the results were…
If our World Model is sufficiently detailed for its function and sufficiently complete for the problem and task in the environment, we should be able to substitute this world model for the real environment. With that, the agent would not directly see the reality and the original environment, but only sees what our world model lets it see and was able to capture, learn, and compress from the reality.
This might lead to some inaccuracy compared to reality but if you compare it to our dreams as a human, those are sometimes also not accurate. For example, if we dream about being able to fly or have superhuman traits. Still, we are able to draw conclusions therefrom and somehow generalize. It would be interesting if our agent could do this as well and improve his learning and generalization.
So how did the author test this? The environment of their choice was VizDoom Take Cover, where the monsters in the distance shoot fireballs at our agent intending to kill our agent. The simple goal of the agent is to stay alive as long as possible. Each step or frame the agent stays alive can be seen as a +1 reward summing up over the time until the agent dies for the final cumulative reward. The task is considered solved if the average survival time over 100 consecutive rollouts is greater than 750 time steps.
Compared to the CarRacing-v0 environment our World Model also changes slightly. Since we want to simulate a real environment our M-Model will now also predict the reward r obtained by the agent and if the agent dies (done state) in the next state. Thus, creating a full and complete World Model which can perfectly mimic the real environment designed by a human programmer.
It perfectly learns how to simulate the essential aspects of the game be it the game logic, enemy behavior, physics, and also the 3D graphics rendering. This only from raw image data collected from a random policy.
Since we only train the agent in our hallucinated dream we do not need the V-Model to encode any real pixels. All learning takes place in the latent space.
A good additional feature is, that it is possible to add extra uncertainty to the virtual environment. In the real environment, this would not be viable, but in the virtual environment, we can just increase the temperature parameter τ and make it more challenging for our agent in its dream. Thus, opponents behave more randomly and less predictable for our agent.
The overall iterative training procedure is the following:
But now let’s have a look at the results achieved by hallucinating.
Hallucinating Results
After solely training in the virtual environment, the agent obtained an overall score of 900 in the virtual environment and thus solving the task. But wait this is only the virtual environment. How will it perform in the real environment?
When transferring the agent from the virtual to the real environment to test its performance, the agent achieved a score of 1100 over 100 consecutive trials! Which is far more than the required score of 750 and even more importantly reaching a higher score than in the virtual environment!! How can this be?
The authors say that an increased temperature as mentioned above leads to a higher uncertainty for the agent, helping the agent to generally perform better. In fact, increasing τ helps prevent our controller from taking advantage of the imperfections of our world model.
More precisely, the temperature helps to prevent that our agent cheats in the World Model. Why cheating? Well, since the agent has direct access to the hidden state it essentially granting our agent access to all of the internal states and memory of the game engine, rather than only the game observations that the player gets to see. Therefore, our agent can efficiently explore ways to directly manipulate the hidden states of the game/virtual environment in its quest to maximize the expected cumulative reward.
Hence, it is important that we can adjust the temperature parameter τ to control the amount of randomness in the M-Model, consequently controlling the tradeoff between realism and exploitability.
Future Research Directions & Follow-Up Papers
Overall I am really impressed by the concept and results the authors show in their paper. But one thing made me even more excited and interested, which is the future research direction. The authors mentioned planning to incorporate artificial curiosity and intrinsic motivation to the world model agent. They also gave one example of how this might be included.
The task of the M-Model is to create a probability distribution for possible next frames, if the model does a poor job, then it means that the agent encountered parts of the environment which it is not familiar with. Therefore, it might be possible to adapt and reuse the training loss of the M-Model to encourage curiosity. Meaning we flip the sign of the loss and use it as an additional reward signal for the agent to encourage it to explore parts of the environment it is not familiar with. This will further lead to a more precise and better world model. What a simple but great idea!
Now, after this long article, we are coming to an end and I hope you enjoyed the exploration of World Models as I did. For more details, I encourage you to read the original paper: World Models
In the near future, I will add here my own implementation of World Models and show some of the results. Also, I plan to write about the Dreamer algorithm in the next article, which is a subsequent algorithm based on the concept of hallucination of the world model algorithm.
In the meantime, if you don’t want to wait, feel free to check out my other articles about Reinforcement Learning, for example: