How to understand and code the bag of words model.
Natural Language Processing (often called NLP) is a foundational part of computer science. In fact, the ability for computers to communicate with humans in a natural language is an ambition that goes back to the very moment computers were invented.
With the development of machine learning the progress in this field has skyrocketed. Just look at the increase of papers published in NLP (see graph below) over the last couple of years. However, to make natural language fit for machine learning, some work has to be done.
Because natural language is different from other types of data, it’s unstructured. Therefore it can’t be interpreted by machine learning algorithms the way humans can interpret it. To tackle this problem natural language has to be converted to a more structured form of data, and in this article we’ll explore one of the most foundational models to do this: bag of words.
The bag of words model is fairly simple and intuitive, it really only requires one thing: a vocabulary. This is a set of words that the model can recognize in a piece of natural language. Every word that does not exist in the vocabulary will be ignored by this model. So let’s declare a simple vocabulary of five words (keep in mind that in practice you’ll probably want to have a larger vocabulary than this).
After declaring the vocabulary we can take an arbitrary piece of natural language, and convert it to a more structured form of data that can be used for machine learning. For our model we’ll take the sentence “I get most of my news from the internet, especially blogs.”
The way we do this conversion is by marking which words our vocabulary recognizes with the amount of times they occur in the sentence. This will output an array that shows the occurrence per word in the vocabulary.
This array is the output of the model and a structured representation of the sentence we started with. From here on out we can let machine learning do what it does best. Predicting one set of numbers with another set of numbers. All based on natural language 🙂
A limitation of the bag of words model is that it can’t capture context very well. Sometimes the order in which words appear determine the meaning of the text. For example, the sentences: “This is very bad. Not good!” and “This is not bad. Very good!” have a completely different meaning, but the bag of words model would produce the same array as output.
The bag of words model can easily be created with code. This example is written in python.