The term word embedding has been floating around for years now, so most of you have most likely already had some contact point with it. This post of the series ML ED (Machine Learning, Easily Digestible) will focus on explaining the main principles behind this exciting methodology, like what is an embedding in general, what kind of information does it capture about a word, and what are points that need to be taken into account when deploying them. In the next part, we will take an overview of different kinds of word embedding models, exploring what their strengths and weaknesses are. Enjoy reading!
Embeddings are representations of things in another world.
Musical notes, for example, are defined on musical staves, but if you want to, you can embed them into colour-space, by assigning each note to a specific colour. In doing so, a mapping is created, holding a colour-representation for each note.
But why would we even want to do this? Well, some problems are very hard, or even unsolvable in their initial world. However, if you have the possibility to see them from another viewpoint — having them embedded into a different space — the task might be far easier.
For example, giving a deaf person a music sheet, and trying to make them experience music, this person might have a hard time. But if you transform the notes into colour, and the musical piece therefore in some kind of painting, you can make the music ‘seeable’ and therefore understandable and interpretable for those without hearing, enabling them to feel the goosebumps.
Machine learning models are deaf and blind to everything except numbers. Words, pictures, recordings; all nonsense to the machine as long as it is not converted to numbers. This means that if we want to do reasoning, natural language processing, question-answering, all this very cool machine learning stuff, we first need to embed the words we want to work with into vector space, meaning that we are assigning each word with a sequence of numbers → a vector.
To make things a little more complicated, ‘embedding’ is an ambiguous word: it describes both the process of creating a representation of an entity in another world as well as the representation itself. Language is complicated.
Going back to our example with the transformation of musical notes into colours, of course, the helpfulness of the new representation highly depends on the mapping itself: if the notes of a musical scale are assigned to the colours of the rainbow in order, our deaf ‘listeners’ will easily recognise an ascending tone sequence or a colouration. However, if colours are mapped randomly, a perfectly smooth moonlight sonata of Beethoven will end up as Jackson Pollock painting. Working with these now skewed, non-meaningful representations of musical pieces is difficult, as the characteristics somehow got lost in translation and our hearing-impaired person might interpret the Pollock painting as a “Klavierstück” from Karlheinz Stockhausen rather than the moonlight sonata.
The same principle applies to machine learning models, words, and their vector representations: if we do not capture the meaning of a word in its vector, the machine learning model will have a hard time understanding it.
But how to keep the meaning of a word when embedding it?
Fortunately, we don’t have to worry too much about how to do that, as we have well-established word embedding models — like word2vec, Glove, FastText, ELMO, and BERT to name the most prominent ones — which do this automatically during their learning process. We feed them our texts and they analyse them — word by word — and try to understand how the words in these texts are used, in which contexts they appear, which words are used in similar contexts (e.g. synonyms), which words are related and how they are related. At the end of the training, the models have packed all of the information they collected into a compact vector.
At this point you might ask “but how is all this information kept? For me, the representation of the word Man is just a sequence of numbers…”. Well, this is the point where maths comes into play and the magic happens:
Imagine you have your model trained on your data, and now you look at the embeddings of 4 words: Man, Woman, King and Queen. They might look like this:
On first sight, they might seem like arbitrary arrows, but they actually encode the meaning of the word. For example, how do we get from King to Queen? Semantically, we have to change the gender, meaning that we have to ‘subtract the man-part’ and ‘add a woman-part’. Magic, magic, if we subtract the Man vector from King and afterwards add the Woman vector we actually end up at Queen! Or in other words: the distance between Man and King, is the same as between Woman and Queen.
Distance in the vector space represents the semantic relationship of the words
So we can see, the meaning is captured in the vectors, while their relationship can be derived from the distance between them! Similar words will end up close in the vector space, while very unrelated words will be farther apart.
If you would like to explore the capability of word embeddings, the TurkuNLP group has published a website of already trained embeddings to play around with: http://bionlp-www.utu.fi/wv_demo/.
Accuracy. Unfortunately, the example shown above resembles the ideal case rather than a real-world scenario. In reality, the vectors will not fit perfectly, and in some cases, they will not fit at all. However, even small deviations from the ideal case can have an impact. The picture below shows slightly changed vectors for each of the four words from our example above. If we again do our King–Man+Woman calculation, we do not end up exactly at Queen, but only near it.
So to face reality: with the word embeddings that you created from your text resources — even if your data quality is extremely high — you will most probably not be able to do nice vector arithmetic like King–Man+Woman=Queen, and even something like King–Man+Woman≈Queen will be very hard to reach. However, this does by no means imply that the embeddings are useless! You will still be able to cluster your data and derive important semantic information from it.
Data. How well the model actually performs depends to a great part (besides some hyperparameter settings) on its main source of knowledge: the data it was trained on. It is important to understand that not only the quantity plays an important role, but that the quality is most crucial. This means that even if you have huge amounts of data, if it is just garbage, the produced embeddings will be useless. But what is good and bad quality? Well, low quality does not only address obvious errors such as typos or misinformation. Only because you can read the text, does not mean your model can read it. For example, headers, footnotes, and especially vertically written text on the side of your pdf document can mess up your text flow when converted to text. Also, structural elements such as bullet points or tabular text is a major challenge to deal with for your model. As your data is the knowledge source for your machine learning model, you might want to clean it before feeding it to your model, especially, when your amount of data is limited. Cleaning your data can take a lot of time, depending on the data sources, but at the same time, it might also greatly improve your outcomes! The nice thing, however, is, that no prior labelling is necessary when working with word embeddings! Take your emails, documents, website content and let your model learn.
Semantic similarity. One final thing. In the previous section, we said that embeddings have the characteristics of similar words being close to each other. But, what is ‘similar’? This really much depends on the context, and it is important to understand that the context is derived from the text corpus alone! For example, when I have a training corpus that looks like this: “The tiger lives in Asia. The lion lives in Africa. The giraffe lives in Africa. The panda lives in Asia. The rhino lives in Afrika. The cougar lives in America. The capybara lives in America […]”, it is very likely that the embedding for lion is very close to giraffe and rhino (other animals from Africa). However, even though lion is semantically similar to other big cats like tiger or cougar, their embeddings will be far apart, because this relationship is not reflected in your data! So the take-home message here is: know your data, and do not expect your embedding model to know more than what can be read from your data!
TL;DR (too long; didn’t read): Generally, embeddings are representations of things in another world. Word embeddings are words being represented as vectors, where the distance corresponds to their semantic relationship (e.g., similar terms are close together). Word vectors will only approximate the semantic relationships, whereas their quality is very dependent on the data quality. Word embeddings can only model relations that are present in the data. … But seriously, read the article 🙂