This is the pseudocode for the algorithm that we just talked about above:
So let’s start coding! The entire code, ready to run, will be at the end of this article by the way 🙂
To start our DQN, we need to start with the network itself that will approximate the q value of the action-value function. I’m going to hope that you read that gym documentation from earlier and that this should be relatively straightforward. Most of this code is just a standard neural network in PyTorch. This is addressing the “initialize action-value function with random weights” part.
Next, we are going to make the actual memory of our algorithm — “initialize replay memory”. This will be where we save our state, action, reward, state_ and done info, that we will later train on. If you are not familiar with the notation “state_” it just means next state. The replay buffer will need a place where we can add our memories, and also take them out in batches.
Next up, we need to put these two parts together and make the agent itself. The agent will choose the action using the idea of exploration vs exploitation where at the beginning our agent explores a lot and chooses a random action, and as time goes on it adheres more and more to what the neural network suggests. As for where we learn, first, we are going to sample a batch of memories from our replay buffer, and then calculate the target using the first formula we talked about, and then update our network by comparing the target to what we actually predicted.
Now for the main loop where the agent actually learns. This is relatively straightforward. We choose an action, receive info from the environment about our action, save it, and then learn from it.
And that’s it! You’re done! Here’s the entire piece of code:
This is what my graph looked like after one thousand episodes of training:
On the left is the average reward, and the bottom is episodes. It is considered that once you average above 190 for 100 episodes, then you have solved the environment. So, congrats! You have solved CartPole!