Deep Reinforcement Learning and Representation Learning

Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

One major problem of current state-of-the-art Reinforcement Learning (RL) algorithms is still the need for millions of training examples to learn a good or near-optimal policy to solve the given task. This plays especially a critical role for real-world applications in the industry be it for robotics or other complex optimization problems for decision making or optimal control.

Due to these problems, engineers and researchers are looking for ways to improve this sample-inefficiency to increase the speed of learning and the need for gathering millions of expensive training examples.

One idea, the researchers came up with is decoupling representation learning from the actual policy learning for the RL agents. Why would you do that? In current Deep RL algorithms, the policy of an agent maps the raw sensory input to an action that should be executed. Whereby it is an important aspect of learning to control, to learn how to efficiently gather the task-relevant sensory information necessary to make informed decisions.

Typically for robotics and agents that are designed for a variety of tasks, the instreaming information of observations by the sensors may or may not be important for the agent to solve the specific task. However, the agent needs to make sense of these observation inputs. Also in some cases, these observations might need some important preprocessing.

If you keep in mind, that most Deep RL agents use neural networks as function approximators that are mostly only 2–3 layers deep, it might be difficult to learn all the tasks mentioned above, and on top of that the original task of the agent, learning an optimal policy to solve a given problem. Everything compressed and captured in 2–3 layers and all learned by only one single loss function that needs to carry all needed training information for these tasks.

As you can imagine this is very difficult to do for the agent and offers some opportunity to improve the overall performance and generalization of the agent when it is able to create and learn his own representation.

Of course, you can rely on an engineer to design a good state estimator by hand to reconstruct the state vector from a set of observations, but ideally, as it is the one of the main goal of machine learning to automate processes, the agent learns this estimator as well on its own.

Learning such an observation-to-state mapping, prior to solving the RL problem, is known in the literature as state representation learning and is usually done in an unsupervised manner.

Current Applications

Representation Learning has been already successfully applied to several problems. Being it for visual inputs or just raw digital sensory inputs. Thereby the representation is usually learned from auxiliary tasks, that enables the state variable to contain prior knowledge of the task domain.

In the following, I want to give you a quick and short overview of some methods for representation learning.

Methods for Representation learning:

Auto Encoder (AE) compress observation in a low dimensional state-vector. The auto-encoder subsequently learns a state representation that captures only the unique features of the observation, i.e., how they differ from other observations. [Paper]
Slow feature analysis (SFA) based on the idea that most phenomena in the world change slowly over time. SFA learns a mapping from visual observations to state representation that gradually change over time. [Paper]
Model Learning (ML-DDPG) learns a model network based on the concept of predictive priors, which assumes that the next state representation and the reward should both be predictable, given the current state and the action taken in that state. Model learns the observation-to-state mapping by back-propagating the prediction errors, which results in a state representation that is inherently predictable. The advantage of using the predictive priors is that state representation learning is transformed from an unsupervised learning problem to a supervised learning problem. Goal-directed: Observations that do not correlate with the reward or are inherently unpredictable will not be encoded in the state representation! Compared to AE and SFA, those methods do not differentiate between observations that are useful for solving a particular task and observations that are not. [Paper]
Augmented Temporal Contrast (ATC), trains a convolutional encoder to associate pairs of observations separated by a short time difference, under image augmentations. Uses a contrastive loss instead of relying on reward signal to learn visual features, which can be limiting especially for sparse reward settings and also narrowing the learned and acquired representations to be task-specific and not general! Outperforms all other unsupervised learning encoders for image inputs and shows potential for generalization by application in multi-task settings. [Paper]

What all these methods have in common and what conventional wisdom suggests is, that the lower the dimensionality of the state vector is, the faster and better RL algorithms will learn. Learning lower-dimensional representation is motivated by the intuition that the state of a system represents the sufficient statistic required to predict its future and in general, sufficient statistics for a lot of physical systems is fairly small dimensional.

Contrary to this intuition, the paper I want to present increases the state representation and questions whether RL problems with intrinsically low-dimensional states can benefit by intentionally increasing their dimensionality using a neural network with good feature propagation.

In their paper Can Increasing Input Dimensionality Improve Deep Reinforcement Learning? the authors introduce their representation learning module OFENet that learns to represent high-dimensional state features which is antagonistic to the common thought that the state representations is compressed into a lower-dimensional state vector. However, through different experiments, the authors show that OFENet outperforms several other state-of-the-art algorithms in terms of both sample efficiency and performance.

But how does OFENet work in detail?

Their OFENet Module consists of basically three parts. The first module (state-block) maps an observation o_t to a latent vector z_ot. The second module (action-block) takes that latent vector z_ot and the action taken at that state a_t and maps it to a new latent vector z_ot,at combining the state and action representation. These two modules, state-block and action-block, are neural networks based on the MLP-DenseNet as the network architecture. The output of each block is the concatenation of all layer’s outputs. Leading to an increase from input to output. The third part of the OFENet is a simple single layer, the prediction layer. Given the expanded latent vector it predicts the future observation of the system. All three parts are trained jointly based on one auxiliary task of predicting the next state:

You can see the training is completely separated from the actual policy learning process. Though it is supposed to be trained in parallel to the actual policy. But how does it bind in with for example the state-of-the-art off-policy algorithm Soft Actor Critic (SAC)?

The two network types of SAC get as an input now the expanded high-dimensional latent vectors z_0t and z_ot,at for the policy (Actor) and the Q-Function (Critic) respectively. Depending on how many layers the state-block or action-block have, the higher is the dimension of the input vector. However, the authors recommend an increment for each block of 240 with a fixed number of hidden units. For example for the HalfCheetah-v2 task, they suggest each block to have 8 DenseNet layers that means 240 / 8 = 30. Each layer having a fixed hidden unit size of 30 resulting in an increase of the dimensions by 240 over the whole block (initial state-size 26 -> increase to 266). As an activation function for each layer, except the prediction layer, the authors recommend the Swish (SiLU) activation function.

Detailed experiments about how they found the best network architecture and activation function can be found directly in the paper

Before online training of the agent and OFENet in parallel, a dataset of random transitions is collected. With those transitions, OFENet pre-trained for a certain amount of steps. After that the regular training scheme, as described above in the pseudo-code, starts.

Performance

OFENet was tested on 5 MuJoCo tasks, namely Hopper-v2, Walker2d-v2, HalfCheetah-v2, Ant-v2 and Humanoid-v2. To make a valid statement that OFENet has a positive influence on RL algorithms it was added to three different algorithms (SAC, TD3 and PPO) and also compared to one of the other methods mentioned above (Model-Learning ML).

As the results blow show, especially in Walker2d-v2, Ant-v2, and Humanoid-v2, the sample efficiency and final performance of SAC (OFE) outperform significantly those of the original SAC. Further, since TD3 (OFE) and PPO (OFE) also outperform original algorithm, it can be concluded that OFENet is an effective method for improving deep RL algorithms on various benchmark tasks. Compared to the other representation learning method ML, OFENet improves significantly.

Can Increasing Input Dimensionality Improve Deep Reinforcement Learning?

Current Applications

Performance

Footer