Markov Chain Explained

In this article I will explain and provide the python implementations of Markov chain. This article will not be a deep dive into the mathematics behind Markov chains, instead it will prioritize the conceptual understanding of how it works and how to implement it with python. I left resources I’ve used and other materials at the bottom of this article which goes into a deep dive in the mathematics behind Markov chains.

A Markov chain is a stochastic model created by Andrey Markov, which outlines the probability associated to a sequence of events occurring based on the state in the previous event. A very common and simple to understand model which is highly used in various industries which frequently deal with sequential data such as finance. The algorithm Google uses on its search engine to indicate which links to show first is called the Page Rank algorithm, it’s a type of Markov chain. Through mathematics, this model will use our observations to predict an approximation of future events!

The main goal of the Markov process is to identify the probability of transitioning from one state to another. This is one of the main appeals to Markov, that the future state of a stochastic variable is only dependent on its present state. An informal definition of a stochastic variable is described as a variable whose values depends on the outcomes of random occurrences.

As stated above, a Markov process is a stochastic process which has memoryless characteristics. The term “memorylessness” in mathematics is a property of probability distributions. It generally refers to scenarios when the time associated to a certain event occurring does not depend on how much time has already elapsed. In other words, when a model has a memoryless property, it implies that the model ‘forgets’ which state the system is in. Hence, probabilities would not be influenced by the previous states of the process. This property of memorylessness is the main characteristic of a Markov process. The predictions associated to a Markov process are conditional on its current state and is independent from past and future states.

This memorylessness attribute is both a blessing and a curse to the Markov model in application. Imagine the scenario that you wish to predict words / sentences based on previously entered text (similar to how google does for gmail). Well the benefit of using the Markov process to do this is that the newly generated predictions would not be dependent on something you wrote paragraphs ago. However, the downfall to this is that you won’t be able to predict text which has context in a previous state of the model. This is a common problem in NLP (natural language processing) and an issue many models face.

A Markov chain model is dependent on two key pieces of information:

Transition Matrix (denoted as P)— this NxN matrix represents the probability distribution of the state’s transitions. The sum of probabilities in each row of the matrix will be 1, implying that this is a stochastic matrix.
Note : A directed, connected graph can be converted into a transition matrix. Each element in the matrix would represent a probability weight associated to an edge connecting two nodes

This graph outlines the probability associated to move from one state to another. For example, there is a 60% chance to move from state B to state A

      +-----+-----+-----+          
|  A  |  B  |  C  |   - Represents the network above
+-----+-----+-----+-----+   - NxN transition matrix 
|  A  |  .2 |  .3 |  .5 |   - element hold probabilities
+-----+-----+-----+-----+   - row sum of probabilities = 1  
|  B  |  .6 |  0  |  .4 |   - .3 is the probability for state A 
+-----+-----+-----+-----+     to go to state B
|  C  |  .1 | .7  |  .2 |   - .7 is the probability for state C
+-----+-----+-----+-----+     to go to state B

Initial State Vector (denoted as S) — this Nx1 vector represents the probability distribution of starting at each of the N possible states. Every element in the vector represents the probability of beginning at that state.

Given these 2 dependencies, you can determine the initial state of the Markov chain by taking the product of P x I. In order to predict the probability of future states occurring you can raise your transition matrix P to the M’th power.

A very common application of Markov chains in data science is text prediction. It’s an area of NLP which is commonly used in the tech industry by companies like Google, LinkedIn and Instagram. When you’re writing emails, google predicts and suggests you words / phrases to autocomplete your email, when you receive messages on Instagram or LinkedIn, the app suggests some potential replies. These are applications of a Markov chain we will explore, albeit in reality the types of models these large scale companies are using in production for these features are definitely more complicated.

Suppose you had a large amount of text associated to some topic, you can imagine each sentence as a sequence of words in that corpus of text. Each word would then be it’s own state and you would associate the probability of moving from one state to another based on the available words you are connected too. This would allow you to transition from one state to another based on the probabilities associated to the transition matrix. This can be visualized below.

This sequence can be envisioned as a connected, directed network

For visualization purposes, this network above has a small amount of words in its corpus, when dealing with large amount of text like the Harry Potter series, this network would become very large and complicated. If you look at the beginning word Hello there are 3 potential other words / symbols following it Everyone , There with their associated probabilities. The simplest way that these probabilities are calculated is by frequency of the word in the corpus.

Hello --> ['Everyone', ',', 'Everyone', 'Everyone', 'There', 'There', 'There', There', There', ...]

If there were 20 words in the list above, where each word is stated after the word Hello then the probability of each word occurring would follow the following formula :

P(word) = Frequency of Word / Total Number of Words in ListP(Everyone) = 9 / 20
P(,) = 1 / 20
P(There) = 10 / 20

Your initial state vector would be associated to the probability of all the word you could’ve started your sentence with. In the example above since Hello is the only word to start with, the initial state vector would be a Nx1 vector with 100% of the probability associated to the word Hello. You can predict the future states through this Markov model. I’ll show you how to implement this in Python next.

I predicted sentences Shakespeare would write in a play. The data I used can be downloaded here. Upon downloading the dependencies, Shakespeare data & updating the file_path to where you saved it, this code can be ran on your computer as well. If you want to use a different text file to train the model on and predict, then just update the file_path to your specified location.

Footer