Methodology
Upon reading the ARS paper, I was immediately compelled to consider what the applicability of this approach to reinforcement learning might be in the field of autonomous navigation. Sentdex had provided a framework to gather RGB camera sensor data from Carla for training a Deep Q Network (DQN), and I saw an opportunity to use his car environment design to test the ARS algorithm from the 2018 study on this same task. I discovered ARS as a result of a YouTube exploration into DQNs, when I came across an informative video series on reinforcement learning by Skowster the Geek (Colin Skow), in which he decided to include a short video about ARS because of its relevance to the topic. Captivated by the simplicity of the algorithm, I decided to look into it further, as I was beginning to realize that my hardware may not be capable of achieving meaningful results training a DQN in Carla, after seeing the lackluster results that Sentdex was able to achieve using far superior hardware over multiple days.
Skow provides a GitHub repository component to his video course on RL, where he provides coding examples for all of the material covered, including a framework to run ARS learning on environments found in the Python gym module. First, I tested this code on the BipedalWalker environment to witness its efficacy for myself. Below we can see curve of episode rewards over training step for this test.
We can see that for the first 600 training steps, the reward curve is basically flat, but that eventually the random deltas started to build the policy in all the right places, creating a steep curve of improving rewards that eventually levels out around the 800th episode, with some occasional mistakes along the way that also eventually disappear around this time. Below, we can see the resulting behavior of this trained policy on the BipedalWalker environment:
When I saw that the algorithm was able to achieve impressive results with computational ease over a reasonable amount of episodes, I decided to see if I could splice Sentdex’s Carla vehicle environment into this framework. In order to improve the input data that the algorithm was trained on, I transformed the raw RGB camera data into something which could represent generalized edge-cases by first passing the RGB camera frames through a pretrained Convolutional Neural Network (CNN) called VGG19 (available in the TensorFlow/Keras package) on their way into the ARS algorithm. For this study, the ‘imagenet’ weights of the VGG19 were used. This is an example of Transfer Learning, where we can take advantage of the bottom layers of a neural network which has been previously trained on massive amounts of image data in order to apply its generalized edge-detection to a different problem. Since we don’t need to train all of these layers, we can just use the output of their predictions as an input into our single-layer ARS perceptron.
The inputs of ARS need to be normalized, and typically this is done by taking running statistics of each of the input components, then using these statistics to do a mean/standard deviation filter (normalization) of the future inputs. This allows the algorithm to create appropriate distributions for normalizing inputs over time as it experiences more and more states, without needing prior knowledge of these distributions. For this study, since the VGG19 prediction outputs always occurred on a scale from 0 to 10, this method of normalizing inputs with running statistics was replaced with simply dividing the inputs by 10 to normalize them. Further research testing the standard filtration method in this context may be warranted, but it seems likely to cause an issue with treating unseen edge-cases as extreme outliers when they occur after a considerable amount of states have been observed. The scale that the inputs are on will affect the scale that the weights are on as well, so adjustments to this part of the process will have effects on what learning rate and delta standard deviation are most appropriate to facilitate learning. This is one reason why normalization of the inputs is so important.
The Sentdex vehicle environment needed to be modified to use a continuous action space with continuous control values, and the reward system needed to be adjusted to work more effectively. When self-driving cars are penalized for collisions, they tend to learn how to drive in circles to avoid them, so close attention was paid to finding ways to punish extreme or consistent directional steering, and reward moving in straight lines at speed. This took some careful consideration. Since collisions were being penalized, the rewards for speed and straight lines needed to be sufficiently high to counterbalance the punishments for the abundance of collisions sure to be encountered when a car moves faster and turns less, so that these penalties would not be so high as to discourage the agent from reaching these goals. However, the collision penalty must be high enough if the hope is that the agent will eventually learn not hit things. It can logically be expected that the edge-cases which need to be learned to avoid collisions while driving at speed will only be experienced by the agent when the workers encounter them as a result of moving about more liberally; in other words, the agent must first learn to move before it can find patterns to avoid while moving. All that being said, without results to compare different reward/punishment systems, this took a bit of guesswork.
The predictions for throttle, steering, and brake coming from the perceptron were clipped to fit into their appropriate control ranges for Carla (0 to +1, -1 to +1, and 0 to +1 respectively), but this meant that the predicted values themselves could fall outside of this range. For throttle and brake, this issue was ignored. However, for steering, which is generally better done with nuance, a consideration was made by penalizing each frame by the absolute value of the steering control value for that frame averaged with the absolute value of the average steering control value across all frames of the current episode, so that the model would learn to prefer keeping the steering control close to zero, since extreme values would add up to a very negative score for an episode. Sitting stationary was penalized to make sure that the cars would not learn to avoid penalty by not moving.
The RGB camera was adjusted to take in an image resolution of 224×224 pixels, since this is the image size that the VGG19 CNN was trained on. Below, we can see an example of an image of the Carla world seen through one of these cameras.
Now that we have an idea of what a worker takes in as input, we can consider this simplified visual depiction of the data flow through each worker from it’s way from the camera, through the policy, and to the controls:
After I successfully spliced the Sentdex vehicle environment into Skow’s ARS framework, I realized that the training would still be woefully time consuming because the ARS process has to perform a number of episodes (2x the number of deltas being tested for each update step) before it can make each adjustment to the policy, and since in the context of training a Carla vehicle agent each one of those iterations would be some number of (in this case 15) seconds long, it would be best to try and perform these iterations in parallel, and pool results from multiple workers in each update step. Luckily, the authors of the ARS paper also provide a GitHub repository of their own, in which they include a framework to reproduce their research of training MuJoCo agents with ARS in parallel using a Ray server.
This introduced me to the Ray module, and like some other advanced packages, I struggled a bit getting it to work on my machine. Ray uses a program called Redis to perform cluster computing on local or network scale. Becoming familiar with it will allow one to scale their research once they achieve results which can justify the expense of employing more computational resources. Fortunately, Ray will pull all the necessary levers in Redis for you, but only when you use it properly. As a Windows 10 user, I found that launching the Ray server in a non-administrator Powershell terminal window (a terminal being run as administrator will NOT work), then running the ars.py file contained in the code file of the ARS repository (as demonstrated in the README.md file there) in a SEPARATE non-administrator Powershell window did the trick. Each user will need to set parameters for the Ray server and ars.py file execution that suits the number of CPUs, GPUs, and RAM available on their machine. Detailed instructions on running this process with the Carla ARS agent can be found in the project repository.
Once I was able to train the same BipedalWalker environment using the code in the ARS repository, I knew it was time to set about splicing in the Sentdex vehicle environment into this code. This took some doing, as the code in the ARS repo is reasonably complex, and it was engineered to work specifically with gym environments. It was also not designed to take in a previously saved policy to pick up training where it was left off, which I wanted to make sure was an option when working with a task that was sure to take days worth of training, which one may wish to periodically interrupt, or recover in the event of an error. This gave me a great opportunity to dig into the nuts and bolts of coding with Ray, and I look forward to training agents on large clusters in the future with that knowledge. On my gaming laptop, I was able to get a local cluster running which could handle 4 car workers running in the Carla server at once, which was much better than just running one at a time. The resulting code which trains these Carla vehicle environments using ARS in parallel can be found in the ARS_Carla folder in the project repository. It is a Frankenstein-esque combination of Sentdex’s CarEnv for Carla, the code from the ARS repository, and my own modifications/augmentations needed for this task. In the next few paragraphs, I will quickly summarize some of my modifications.
As discussed above, the ARS process usually normalizes inputs by creating running statistics of the mean and standard deviations of the observation space components, so that it can normalize these inputs effectively as more and more states are observed, without requiring prior knowledge of the input distributions. This functionality was not necessary in the context of this task, since the output values of the VGG19 CNN are already scaled with one another with a known range between 0 and 10, so it was only necessary to divide the arrays by 10. Nevertheless, I wished to preserve this functionality for comparison later, so I compartmentalized it and wrapped it into a boolean parameter called ‘state_filter’ which can be passed when the code is run.
I also added optional functionality to pass in a pre-existing policy through utilization of the new ‘policy_file’ parameter, which can take in the location of an .csv or .npz containing weights, and, in the case of the .npz file, possibly information to initialize the observation distributions of the state filter used to normalize inputs if the policy was trained with one (in the same format that the .npz files are saved during logging steps, of course!). This allows one to pick up training where they left off at a later time. Further, I built in two additional parameters ‘lr_decay’ and ‘std_decay’ which will reduce the learning rate and size of random perturbations applied to the weights over time, in order to allow for more exploration early on, and eventually favoring smaller learning steps once the agent has some training under its belt. Another parameter, ‘show_cam’, accepts an integer value that determines the number of worker cameras to be made visible to the user during training. For long training episodes, I recommend setting this to zero to save CPU overhead, and turning it on later to watch the performance of a previously trained policy from a first-person perspective. It is always possible to watch workers from an aerial perspective from the Carla server window, regardless of how many cameras are being shown.
ARS usually works using the sum of the rewards over all steps in an episode (aka rollout). Since each step for our car environment is a frame that the camera sees, this creates somewhat of a challenge, as when the workers were made to report how many frames they saw on each episode, it was found that the frames per second that any worker sees is inconsistent over time, depending on how well the CPU is performing at any instant. This means that simply summing the rewards per episode may give us an inaccurate representation of the performance of a worker during a rollout, since some workers might see more frames per second than others due to fluctuating computational performance, and therefore have more chances to log rewards or punishments. For this reason, it was decided that the more appropriate measurement in this context should be the average reward per step in an episode, calculated simply by taking the sum of rewards over all steps of the episode and dividing it by the number of steps the worker saw. This way, we are able to normalize the scores to account for variable frame rates during training.
Episodes were ended as soon as a worker registered a collision. This created an interesting issue, since Carla drops the vehicles onto the map from a very small height (possibly to keep them from getting stuck in the pavement), and sometimes this creates a small shock to the vehicle which registers as a collision on the first frame of an episode, effectively terminating the episode immediately with a punishment recorded whenever this occurs. For this reason, I instructed the workers to disregard collisions on the first frame of an episode, which fixed the problem and gave each delta a fighting chance to demonstrate its value.
The interested reader can find all of the details of this methodology by reviewing the code and documentation in the repository. The reward system can be found within the ‘CarEnv’ class in the ars.py file. To run the code and train your own Carla ARS agent, go to the repository and follow the instructions in the ‘README.md’ file. Note that Carla is an absolute resource hog, and will operate at whatever level it can squeeze out of your machine. This means that for many users it will be necessary to limit the amount of resources available to it in order to keep their motherboard from melting during multi-day training periods. The easiest way to do this (on PC) is to change the ‘Maximum Processor State’ in your advanced power settings to a level which keeps your CPU running at an acceptable temperature (no higher than 80 degrees C is recommended). The BEST way to handle this would to allocate a desired number of resources to a virtual machine, and install Carla and the dependencies there, so that you still have the rest of your resources available for work on the host machine. For this study, I used a gaming laptop with a not-too-shabby quad-core i7-7700hq processor and NVidia GeForce GTX 1060 graphics card, and ran the training with ‘Maximum Processor State’ set to 83% to prevent overheating.