Randomized Ensembled Double Q-Learning: Learning Fast Without a Model

Motivation

Model-Free Reinforcement Learning Algorithms have achieved impressive results and Researchers come up with new and better ideas to further improve their performance. But despite all their benefits and improvements in recent papers, it is a common consense, that Model-Free algorithms are extremely data inefficient. Requiring millions of frames or examples to learn optimal policies and exact value functions. Thus making them not suitable for real-world applications in the industry as for example robotics.

In contrast, Model-Based approaches were Introduced that often claim to be much more efficient than their Model-Free counterparts, due to the possibility for planning, look-ahead search, or data-augmentation with the given or learned Model of the environment. But is this data-efficiency really the case?

Recent MB-RL papers show clearly the gain in efficiency using a model. However, in the analyzed cases, the Model-Free algorithms had a much lower Update-to-Data (UTD) ratio, which is the number of updates taken by the agent compared to the number of actual interactions with the environment. For example, the state-of-the-art MB-RL algorithm Model-Based Policy Optimization (MBPO) utilizes a large UTD ratio of 20–40. This high ratio is possible since the algorithm updates the agent with a mix of real data from the environment and “fake” data from its model. In comparison, the state-of-the-art MF-RL algorithm Soft-Actor-Critic (SAC) has a UTD ratio of only 1.

High UTD ratios in MF algorithms usually cause instability in training and can lead to convergence problems and overall bad performance. However, a recent paper (Do Recent Advancements in Model-Based Deep Reinforcement Learning Really Improve Data Efficiency?) showed that with fine-tuned and adapted parameters higher UTD ratios are possible for Model-Free agents. Their agent OTRainbow (OverTrainedRainbow) used 8 network updates per interaction and achieved an equal sample efficiency as the compared Model-Based RL algorithm (SimPLe) it was compared to on the Atari game environments. Thus, needing much less computation power and having a much faster training (24 hours compared to 3 weeks).

The question is, are even higher UTD ratios for MF-RL possible, ratios close to the ones of MB-RL? The paper I want to present (Randomized Ensembled Double Q-Learning: Learning Fast Without a Model) gives an answer to that and shows impressive results!

Can an ensemble of Q-Networks outlift the power of World-Models?

In the paper Randomized Ensembled Double Q-Learning (REDQ) the authors present a simple Model-Free algorithm that achieves state-of-the-art performance and thus in a manner of sample efficiency which is as good or even better than modern Model-Based algorithms. Hence, the algorithm uses no rollouts and performs all updates on real data obtained from the environment. But how did they achieve that? The author’s name 3 main points:

Using a UTD ratio >> 1
Having an ensemble of Q-Functions
Doing in-target minimization across a random subset of Q-Functions from the ensemble

REDQ can easily be applied to current Offpolicy Model-Free algorithms. In the paper, they use SAC as a baseline and also change it accordingly to create the REDQ agent. Compressed describe the three hyperparameters G, N, M all the necessary changes. G is the number of executed updates per step (UTD ratio). In the paper, the authors set it to G=20.
N defining the number of Q-Functions in the ensemble. Having an overall number of N=10 Q-Functions seems to work best as the authors suggest. And finally, M being the number of randomly sampled subsets of the ensemble, used to create the Q-target values.

By changing these three key parameters we can basically create two other MF algorithms that share some similarities. When N = M = 2 and G = 1, then REDQ simply becomes the underlying off-policy algorithm such as SAC. When N = M > 2 and G = 1, then REDQ is similar to, but not equivalent to the Maxmin Q-Learning Algorithm. Maxmin Q-Learning also uses ensembles and also minimizes over multiple Q-Functions in the target. However, Maxmin Q-Learning and REDQ have various differences. Maxmin Q-learning for example minimizes over the full ensemble in the target, whereas REDQ minimizes over a random subset of Q-Functions. Also unlike Maxmin Q-learning, REDQ controls over-estimation bias and variance of the Q estimate by separately setting M and N.

Having defined the hyperparameters for REDQ as G=20, N=10, M=2, the updating schedule for the Q-Function Ensemble look the following:

Q-Function Overestimation

Reinforcement Learning Agents, that utilize a Q-Function for state-action value estimation suffer from an overestimation as you might know. To overcome this issue, the named Algorithms SAC and Maxmin Q-Learning have their own strategies. Thus, as SAC using 2 Q-Functions and using the min of these two Q-Value estimations for updating or as Maxmin having an ensemble and minimizes over multiple Q-Functions. But why does REDQ perform so much better than these algorithms with only slight changes and a much higher UTD ratio?

REDQ has a very low normalized std of bias for most of the training, indicating the bias across different in-distribution state-action pairs is about the same. Furthermore, throughout most of the training, REDQ has a small and near-constant under-estimation bias.

REDQ compared to AVG (SAC with an ensemble averaging) and SAC-20 (SAC with an UTD of 20)

The two critical components of REDQ are the ensemble of Q-Functions and the in-target minimization allowing it to maintain stable and near-uniform bias under high UTD ratios.

Performance

Now let’s see how REDQ actually performs and if it can hold the promises of the authors, being able to compete against state-of-the-art MB-RL in terms of sample efficiency.

Performance comparison of REDQ, SAC and MBPO for the MuJoCo Environments Hopper, Walker, Ant and Humanoid

Compared with SAC, REDQ achieves a greatly better sample efficiency on the MuJoCo tasks Humanoid and Ant reaching a score of 5000 much faster. REDQ and MBPO learn significantly quicker than SAC, with REDQ performing slightly better than MBPO overall. In particular, REDQ learns significantly faster for Hopper, and has a better asymptotic performance for Hopper, Walker2d, and Humanoid. Averaging across the environments, REDQ performs 1.4x better than MBPO half-way through training and 1.1x better at the end of training.

This shows clearly that a simple MF-Algorithm can achieve as good or better sample-efficiency performance as the state-of-the-art Model-Based algorithm for the MuJoCo environments.

Further, the authors compared the computational resources that are needed to train REDQ and MBPO and the number of learned parameters.

Number of parameters for the algorithms MBPO and REDQ

REDQ uses fewer parameters than MBPO for all four environments, specifically, between 26% and 70% as many parameters depending on the environment. The authors additionally measured the runtime on a 2080-Ti GPU and found that MBPO roughly takes 75% longer. Showing overall, that REDQ is not only at least as sample efficient as MBPO, but also has fewer parameters and is significantly faster in terms of wall-clock time.

Improving REDQ with Auxilary Feature Learning

On top of these already impressive results, the authors also tried to further improve the performance of REDQ. Thus, adding and incorporating better representation learning. Therefore, they added the recently proposed the online feature extractor network (OFENet), which learns representation vectors from environment data, and provides them to the agent as additional input.
The authors define the agent with the added OFENet module as REDQ-OFE. And found out that OFENet did not help much for Hopper and Walker2d, which may be because REDQ already learns very fast, leaving little room for improvement. But the online feature extraction could further improve REDQs performance for the more challenging environments Ant and Humanoid.

REDQ-OFE achieves here an impressive boost of 7x the sample efficiency of SAC to reach a score of 5000 on Ant and Humanoid, and outperforms MBPO with 3.12x and 1.26x the performance of MBPO at 150K and 300K data, respectively.

Also impressive are that REDQ-OFE can achieve a much stronger result on Humanoid with much fewer parameters.

Conclusion

The proposed algorithm by the authors displayed impressive results and also showed, that combined with OFE, REDQ-OFE can learn extremely fast for the challenging environments Ant and Humanoid. Further, the results show, that MF-Algorithms can compete with MB-RL in terms of sample efficiency. Making it not necessary, at least for the MuJoCo benchmark, to utilize a model of the environment to achieve high sample efficiency.

I hope you enjoyed the read and I encourage you to check out the original paper for more details about REDQ. In an upcoming article I will have a closer look at the mentioned OFENet representation learning algorithm and will write an article about auxiliary tasks in Reinforcement Learning. But for the meantime, if you want to read more about Deep Reinforcement Learning, feel free to check out my other articles… for example:

or follow me on Medium, GitHub, or LinkedIn.

Motivation

Q-Function Overestimation

Performance

Improving REDQ with Auxilary Feature Learning

Conclusion

Footer