Author: András Béres
Recently the topic of self-driving cars has received great attention both from academia and the public. While Deep Learning can provide us tools for processing vast amounts of sensor data, Reinforcement Learning promises us the ability to take the right actions in complex interactive environments. Using these tools could be one way to solve the self-driving task, however some problems make the real world application of these techniques difficult.
Since AI agents collecting large amounts of experience while interacting with the real world is usually too expensive and sometimes even dangerous, we usually train the agents in a simulator, and then transfer them to the real world. During my work, I have used the Duckietown self-driving platform to train lane-following agents in simulation with reinforcement learning, and transferred them to the real world.
Duckietown is a open-source self-driving platform, which consists of multiple main parts, one of which are the Duckiebots, which are small-sized autonomy-capable vehicles, that are controlled by a Raspberry Pi, and are equipped with a single camera. They are differential drive vehicles, which means that they do not use a servo motor for steering, instead their motors are independent on their sides, and the vehicles can turn by driving their motors at different speeds.
Another part of the system is the actual Duckietown, which is a small scale real, physical driving environment, which can be used by the Duckiebots for driving, therefore their performance can be evaluated in the real world.
The last main part is the Duckietown Gym, which is a self-driving car simulator, implementing the OpenAI Gym interface. The simulator contains multiple maps that provide tasks such as lane following (sometimes with other vehicles), navigation in junctions and pedestrian- (duckie-) and obstacle-avoidance.
Since our simulators are only imperfect models of reality, the performance of agents is usually reduced after the transfer. This problem is called the Sim-to-Real Gap. One can use Domain Randomization, and perturb certain parts of the simulation during training, therefore forcing the agent to be robust against changes in its environment. This makes a successful transfer more probable, usually at the cost of requiring more training samples.
A technique that can enhance the speed and performance of reinforcement learning is Representation Learning. That means, that with the help of either pretraining or a secondary task, we train a feature extractor or encoder, that compress the input images to lower-dimensional meaningful representations. These are generally implemented with neural networks, that downsample their input. By using these representations for reinforcement learning, we make the task easier and the observation-space lower-dimensional, which makes that part of the training easier.
I experimented with two methods for representation learning. One is supervised representation learning, in which we use expert (human) knowledge to tell the encoder exactly what representation to learn. The other is unsupervised representation learning, where it is free to learn any representation that fits its pretraining task.
For supervised representation learning, I have chosen the following representations: we encode each input image frame to a tuple of three physical quantities, the signed distance of the middle of the track from the vehicle, the signed angle of the track with the vehicle, and the curvature of the track in front of the vehicle. I have pretrained my feature extractor with regression to predict these quantities for each frame and used the representations for triples of frames as the input of the reinforcement learning agent.
I have generated an offline dataset that contains 200.000 images and their corresponding ground truth physical quantities and used that for pretraining. After that, I froze the encoder and used it as an observation wrapper in the Duckietown Gym. Using these representations helped me reach a higher final performance compared to end-to-end reinforcement learning.
In some cases, we do not know what the optimal representations are for solving a specific problem, or sometimes the labeling of the data would be too difficult or expensive. In these cases, one can use unsupervised learning for representation learning.
One popular method, especially in reinforcement learning, is to use a Variational Autoencoder (VAE) to compress the input images into latent representations. Simply put, VAEs encode their inputs into a latent distribution and try to reconstruct them using a sample from this distribution. The distributions are regularized during training to enforce a smooth and compact latent space, which is useful for reinforcement learning.
The method I used was to combine VAEs with Visual Domain Randomization in an efficient way. My idea was to reconstruct the non-randomized versions based on the randomized input images. This made the pretraining task harder and also ensured that the model does not waste latent capacity on the ever-changing visuals of objects that are changed during domain randomization. This means that the method can be seen as a combination of VAEs and Denoising Autoencoders.
With the unsupervised representation learning technique, I have attended the Student Research Conference of Budapest University of Technology and Economics, and have been awarded 1st prize in the Neural Networks section.
With the supervised representation learning technique, I participated in the 5th AI Driving Olympics, and in the lane following challenge the method achieved the 2nd highest performance in simulation and the 1st in real evaluation.