This article is a comprehensive overview of some numerical representations of text data for Machine Learning algorithms.
Building machine learning models is not only restricted to numbers, we might want to be able to work with text as well. However, those models can only be fed with numbers. To bridge this gap a lot of research has gone into creating numerical representation for text data. In this article, we will explore some of them : One-hot encoding, Count Vectors, Tf-Idf, Co-occurrence vectors and Word2Vec. We will also discuss about their advantages and drawbacks.
Further in the article, context and neighboring words will be used to mean the same thing.
Disclaimer: advanced preprocessing technics will not be performed since the goal is to really understand the concepts behind each algorithm. But this could be the topic for another post.
This algorithm is used to generate a vector with length equal to the number of categories in your dataset, a category being a single distinct word.
Say for example that we want the one hot encoding representation for each of the three following documents corresponding reviews about a restaurant.
The one-hot encoding representation of each document is done following these steps:
- Step 1: Create a set of all the words in the corpus
- Step 2: Determine the presence or absence of a given word in a particular review. The presence is represented by 1 and the absence represented by 0. Each review will be then represented as a tuple of 0, 1 elements.
At the end of the two steps, we can finally get the one-hot encoding representation of all our three reviews (R1 to R3). This technic seems straightforward, but presents the following drawbacks:
- Real world vocabulary tends to be huge, so the size of the vector representing each document will consequently also be huge, no mater the number of word in a given document.
- We completely lose the order in which the words appear in the review/document, which unfortunately leads to lost of context.
- The frequency information of words is lost due to the binary representations. For example the words great appears twice in the first review also the word better appears twice in the second review, but there is no way to express that, all we know is their existence.
This algorithm is very similar to the on-hot encoding, but it has the advantage of identifying the frequency/counts of the words in the documents they appear. We can apply the count vectors to our previous corpus following these steps:
Step 1: Convert each document into a sequence of words containing that document.
- Step 2: From the set of all the words in the corpus, count how often the word occurs in the document.
From the previous table, we can notice that:
- The word great appears twice in the sequence of words for the first review, so its value is represented by 2. Same for the other words, 1 for restaurant, 1 for service, 0 for better, etc.
- The word better also appears twice in the sequence of words for the second review, then its value is represented by 2, etc.
- The same analysis goes for the third review.
As you can notice, the drawbacks of the count vectors are similar to one-hot encoding in terms of:
- The size of the vector representing each document. But a commonly used technic to mitigate this issue is to only choose the top n words based on their frequency. n being the count of each word.
- No context capturing of the words.
- Lost of semantics and relationships between words.
This algorithm is an improvement of the count vectors, and is widely used in the search technologies. Tf-Idf stands for Term frequency-Inverse document frequency. It tends to capture :
- How frequently a word/term Wi appears in a document dj . This expression can be mathematically represented by Tf(Wi, dj)
- How frequently the same word/term appears across the entire corpus D. This expression can be mathematically represented by df(Wi, D).
But where does the term inverse come from?
We’ve all been told since high school that the inverse of something (except 0) is represented by 1 over that thing. The same applies here:
- Idf measures how infrequently the word Wi occurs in the corpus D.
With that additional information, we can compute the Tf-Idf using the product of the tf and idf values using the following formula:
Each document dj will then be represented by the Tf-Idf score of each word in that document like this:
Tf gives more importance (weight) to the words appearing more frequently in a single document. On the other hand, Idf will try to down-weight the words that occur multiple time across the entire corpus, for that we can think of words such as “the”, “this”, “a”, ‘’an’’ etc. Then putting them together (Tf-Idf) helps in capturing the rare words that do not appear frequently in the document.
- Captures both the relevance and frequency of a word in a document.
- Each word is still captured in a standalone manner, thus the context in which it occurs is not captured.
This algorithm is based on the principle that similar words will occur together and will also have similar context. To make it easy to understand, we are going to simultaneously apply the context window and build the co-occurrence matrix. But before diving into the process, let understand the context window and co-occurrence matrix.
E.1) Context window
For a given word of interest we identify its neighboring words based on the size of the window. These neighboring words can be considered as:
- the words in the left of the word of interest
- the words in the right of the word of interest
- the words around the word of interest
E.2) Co-occurrence matrix
Before talking about co-occurrence matrix, let see what the co-occurrence of two words is. The co-occurrence of two words W1 and W2 corresponds to the number of time these two words occurred together in the context window.
From there, we can then build the co-occurrence matrix which is a NxN matrix, N being the total number of vocabulary in the entire corpus. So each document will have a size of NxN.
E.3) Step-by-step building process of the co-occurrence matrix/vectors
For educational purpose, we are going to apply the context window of 2 with a sliding size of 1 and we will be considering only one document which is given below.
document = “There are many untapped potentials in Africa”.
In our case N=7, but in real life we can end up with a huge set of vocabulary.
- Create a matrix of size NxN with the set of vocabulary in row and column
- Initialize the matrix with zeros.
- From left to right, identify word of interest and its neighboring words.
- Count how many time the neighboring word appears with the word of interest.
- Increment the cell corresponding to the word of interest and each one of the neighboring words with their corresponding number of occurrence.
- Captures the semantic relationship between the words of the same document.
- It provides and efficient computation, because it can be easily reused.
- It can provide more accurate representation for dimensionality reduction tasks using technics like PCA (Principal Component Analysis).
- We get a huge matrix size which of course requires a lot of memory.
This algorithm is even better than the previous ones to represent text in numeric form, because not only we can have a low dimensionality representation of a word we can also capture their meaning and the context (neighboring words) in which it occurs.
The big assumption with word2vec is that the meaning of a word can be inferred by the company it takes, which might be similar to the following you might hear at least once: “tell me who your friends are, and I will tell you who you are”. Word2vec comes with two neural network architectures: CBOW and Skip-gram (we will not cover them in depth in this article).
- CBOW takes as input the neighboring words and predicts the word of interest.
- Skip-gram in the other hand performs the opposite task. It takes as input a word of interest, and predict its neighboring words.
Yes, I got the general idea of Word2Vec, But how is the semantic relationship captured from vectors?
Once the word2vec model has been trained in a given large corpus of data, it is then possible to get the numeric representation of the input word. The following image illustrates how the meaning and semantic representation is captured:
- Abidjan, Bamako and Dakar are important cities in Côte d’Ivoire, Mali and Senegal respectively
- With a large enough training text corpus, the model would have been able to figure out the underlying cities semantic meaning by assigning numeric representations to the Abidjan, Bamako and Dakar that are very similar in value to each other as we can see in the output.
- Word2vec captures the meaning of words.
- It captures the relationship between words (“Abidjan” ~ “Côte d’Ivoire” == “Bamako” ~ “Mali”), meaning that the subtraction between the vector representations of “Abidjan” and “Côte d’Ivoire” is very close to the result of the subtraction of the vector representations of “Bamako” and “Mali”
- It captures very interesting information from the large training data with very low dimensionality.
I hope you enjoyed your journey through this article, and that it will help you in making the right choice to encode your text data for machine learning tasks. If you have any question or remark, I will be glad to welcome it for further discussions.
For further readings do not hesitate to consult the following links:
Bye for now 🏃🏾