Some Quick Terminology
- Agent: The algorithm that we are trying to create
- Environment: The environment is the setting that you are in. This includes rewards, actions, and states
- Actions: Actions are all the possible things that an Agent can do in an environment
- Reward: The reward is an indicator to tell the Agent if what it is doing is right or wrong
- Policy: The policy is how the agent determines which action it will take in a given state. Usually, the agent’s goal is to optimize this policy so that the agent will get the most rewards possible.
Before I get into the details, I would like it to be noted that I got a lot of my information from David Silver’s Lectures on Youtube. If you are really interested in getting into Reinforcement Learning, you can check those out here!
What Even Is a Markov?
The Markov Processes are named after a Russian mathemetician, Andrey Markov. He was born in 1856 and is best known for his work in stochastic processes. His work was later named The Markov Processes and Markov Chains. A lot of Markov I know 🙂
There are four different types of Markov Decision Processes.
- The Markov Property
- The Markov Process
- The Markov Reward Process
- The Markov Decision Process
Each of these Markov Processes is a way to model a problem so it can be solved using Reinforcement Learning.
Let Us Begin With The Markov Property
The Markov property states:
“The future is independent of the past given the present”
This means that the state St+1 only relies on the information from state S. Once the new state is known, all information from previous states can be thrown away.
The Markov Process
A Markov Process is basically just multiple states that all contain the Markov properties linked together. There are also transition probabilities from each of these states going to one another.
Above is an example of a Markov process with six different states; you can also see a transition matrix that holds all the probabilities of going from one state to another.
Now Let’s Add the Rewards (Markov Reward Process)
The Markov Reward Process(MRP) is a Markov Process with added rewards. Simple right!
This MRP is a tuple with (S, P, R, 𝛾)
- S: Represents the state space
- P: Represents the transition probabilities
- R: Represents the immediate reward
- 𝛾: Represents a new variable called the discount factor (Or Gamma). This parameter is set between 1 and 0. It affects the amount that the agent considers future rewards compared to immediate rewards.
Long Term Rewards
Like I just mentioned, the agent has a parameter that affects the amount it considers future rewards. That must mean that the agent is somehow getting future rewards.
The future discounted reward, and immediate reward is combined into a return variable called G. The goal of the reinforcement Agent is to maximize this return.
Here you can see that the Rt+1 is the immediate reward, and then the 𝛾Rt+2 is the discounted value of the state’s reward after that. This keeps on going on until infinity. Gamma is being squared and then cubed and so on at each time step.
State Value Function For MRP
The State Value function tells us the value of being in each state. This is determined by the long term reward you can achieve in each state. We can get this from the expected return (Gt) starting at a state (s).
Here you can see it is just the expected value E of Gt (the return) in state s ( St= s).
The value function can then be rewritten in the form of the Bellman Equation.
Here you have the value function v(s) is equal to the expected return in state S(E[Gt | St=s]), and then that is transformed into the Bellman Equation by expanding Gt into Rt+1 + 𝛾v(St+1).
This can all then be represented using matrices.
You have your value vector, which is equal to an immediate reward vector at each state, and then that is added to a discounted value of the next states. This is also multiplied by a probability vector, which shows the probability of getting to each different state.
Example Of Markov Reward Process
Here you can see that there is a value for each individual state. Because Gamma is set to 0 in this example, the state value only has to do with the immediate rewards it would receive after transitioning from each state.
So in state 1, the state value is -2 because no matter if it goes to state 2 or state 3, it would receive a reward of -2.p
Finally, the Markov Decision Process
Now, this is mainly the same as the MRP; however, we are adding actions!
The tuple containing all of the values now looks like this. (S, A, P, R, 𝛾)
- A: This represents the actions that you can take from each state
- P: The transition probability is also changed a little because you now include the probability of moving from one state to another depending on if you take action A
- R: The reward function can also now depend on what action you take.
Everything else is the same as before 🙂
With the addition of actions, we also now make use of a policy. The policy tells the agent what actions it should take based on what state it is in. This is contrary to a random probability for each action in MRPs.
In comes the Value Functions
Before, the value function only depended on the expected return in a certain state. With the addition of a policy and actions, the “state-value function” now represents each state’s value when following a certain policy.
In MDPs, we also define a new value function called the Action Value Function (Q). This represents the value of taking an action and receiving the return Gt while following the policy.
Turn It Into Bellman!
Like in the MRP, we can rewrite the state value function into the Bellman expectation equation.
It is decomposed into two different parts.
- The immediate reward
- The discounted value of the next state
Here you can see that the value of being in the state s according to policy (vπ(s) is equal to the expected value of the immediate reward (Rt+1) plus the discounted value of the next state (𝛾Vπ(St+1)).
This can also be done for the Q-value.
The value of a certain action in a state when following policy (qπ(s, a)) is equal to the expected immediate reward (Rt+1) plus the future discounted value of the action that would be picked in the following state according to policy (𝛾qπ(St+1, At+1)).