Without a sophisticated program, machines would not have the capability to process human languages like our brains do. However, computers are extremely powerful at processing numbers and things with mathematical structure. Therefore, in a Natural Language Processing pipeline, it is important to transform raw text into a numerical representation so it could be ingested by a model down the pipeline.
Just like any machine learning tasks, it’s important to first clean and normalize the raw texts before converting them to numerical values. And there are some relevant terms in to be aware of:
- Corpus: a corpus can be a single computer-readable text/document or a collection of them.
- Normalization: the task of converting words/tokens to a standard format, choosing a single normal form for words with multiple forms.
- Lemmatization: a type of normalization methods that determines the root word shared by multiple words. For example, ‘did’, ‘done’, ‘does’ are from the same lemma ‘do’.
- Stemming: a type of normalization technique in which suffixes are stripped from the end of a tokenized word. For example, depending on the stemming method, stemming the word ‘butterflies’ may return ‘butterfly’ or ‘butterfl’.
- Tokenization: the process of separating out words in a text.
- Vocabulary: number of distinct words in a corpus.
One of the simplest way to encode raw text data is Bag-of-Words, which leverages the idea of one-hot-encoding and transforms each document in a corpus into a 1 x V vector, where each elements depicts the word count and V represents the vocabulary size.This process is illustrated in the figure below:
Bag-of-Words is simple and easy to implement. However, there are some pitfalls:
- Word order: Bag-of-Words does not keep the original word order, which can lead to loss in sentence meaning. Therefore, it’s not practical to compare texts encoded with such technique for determining semantic similarity.
- Sparse Vector: Vector representation of documents increases linearly with vocabulary size. This can result in a sparse vector, which is very inefficient for computation and modelling.
- Word Scoring: Using simple scoring method like word counts would emphasize highly frequent words that might not necessarily contain much informational content. For example, the word ‘the’ could be used extensively in a corpus, but it does not carry any information. One could get away by using stop words, but they might not capture all high-frequency terms with little syntactic contribution.
A more advanced version of Bag-of-Words is called Term Frequency — Inverse Document Frequency, or TF-IDF for short. While it does not address the word order and sparse vector problems mentioned above, it does introduce a better scoring method than Bag-of-Words. And it’s based on the following two assumptions:
- Frequently occurring words are more relevant to the document content.
- Frequently occurring words in all documents are less relevant to document content.
In essence, TF-IDF rewards frequency within a document, and penalizes frequency in the corpus. These are referred to as the Term Frequency and the Inverse Document Frequency, respectively.
Term frequency uses the same scoring method as Bag-of-Words. But it also accounts for document length. Concretely, Term frequency assigns each term t in a document d a weight, where the weight is equal to the number of occurrences of term t in the document, divided by the total number of words in the document:
where k is the number of distinct words in document d.
As the name implies, Inverse Document Frequency takes the ratio of the total number of documents N over the number fo documents containing term t:
This would put a degree of penalizing effect on frequency occurring words across the entire corpus. For example, imagine there is a library of 1000 machine learning publications, and 800 of them contains the term ‘learning’, which carries little context. The corresponding IDF would be 1.25. In contrast, terms such as ‘sentiment’ may only appear in 10 documents, which gives an IDF of 100.
This can quickly become a problem when the document size increases, and the ratio can grow out for control. Therefore, IDF normally takes another log of the ratio, and the formula becomes:
To put the put terms together, TF is simply multiplied by IDF to give:
The multiplication effect will produce the following result:
- Large TF: highlights relatively common terms in the document.
- Small TF: highlights relatively rare terms in the document.
- Large IDF: highlights rare terms in the corpus.
- Small IDF: highlights common terms in the corpus.
In summary, TF-IDF emphasizes terms that are relatively rate in the corpus but are used repeatedly in the document, which makes intuitive sense.
TF-IDF is not a super complex encoder to implement. Conveniently, python package scikit-learn has a function that can do all of the above in a few lines of codes:
And each function can be implemented easily using the following code snippet:
For more details on how this manual implementation can be used, check out the following GitHub repo.
In this post, I went over Bag-of-Words and TF-IDF, which are two simple ways to encode raw text data. The key takeaway is that TF-IDF has a unique scoring method in which it emphasizes words that are common in a document, but rare across all documents in a corpus.