Learn and Understand Word Embedding and Depth to Master NLP

Word to Vector : Understanding the Concept

NLP is buzzword and there are plenty of problem statements to experiment with. The more deeper you go, you will get more insights about data. Innovation to explore data is capable of producing new things to solve existing problem statement. We already experienced feature engineering in NLP using tfidf and pmi in earlier blog. Link is below. Now time to move to the next steps to feature engineering.

Feature engineering in NLP can be known as vector compression. And, idea for doing this to get less sparse vector and better performance. Other dimensionality reduction like SVD can be applied too. Now, let us move ahead and learn word embedding — word2vec, CBOW and skipgram.

Word embedding is the way to have more dense vector and word representation in such a way that it is capable of bridging the gap in human understandable format and machine interpretation. In simple words, word embeddings is the process where we represent the text in n-dimensional space. This is one of the important step in solving NLP related problems.

One hot encoding is the representation of text in vector format but it is very much sparse. Whereas word embedding is more dense in nature and can be called as dense representation.

One hot encoding representation of Given Text

As we can see it output of one hot encoding is sparse with lots of 0 in vector. After multiplying one hot encoding to word embedding vector, we can get more dense and better output.

Simple example of Word Embedding

Now final output is transpose of the values based on the multiplication of one hot encoding and word embedding. And, we can see if this is applied to all the given words, you will have very compact and dense matrix at your hand. This will be easy to pass to Neural network as a context and expect good output given more compact matrix as input.

Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

Above we are able to convert word / text to vectors. There are certain steps which can be followed to achieve this.

Define task that we want to predict
Go through each sentence and create the task’s inputs /outputs
Iterate through task’s I/O, put the inputs through the embeddings (word) and models to create predictions
Measure cost of the predicted and expected output
Update embedding weights accordingly (*backprop)
Repeat Step 3–5 until desired.

Word2vec consists of models for generating word embedding. These models are very simple NN models consisting of input layer, hidden layer and output layer. It has two architectures to be used : CBOW and Skipgram. Foremost idea is there to generate embedding and then get synonymous words to the given context.

Challenges/Limitations

It is not capable of handling new words. Or, it is not so smart to generalize unknown situations or OOV (out of vocab) words
No shared representations at sub-word levels
New problem statement requires new embedding matrices to produce better results. This looks logical too.
Cannot be used to initialize state-of-the-art architectures
It has problems with handling multiple words like Polysemy and homonymy in structured manner.

CBOW stands for continuous bag of words. As a name suggest, here we will be able to predict the next words given the n number of context words. You would have noticed this while writing mails and messages, you are given with multiple choices of words to select.

Here, the structure is made up of input layer of context words. Then output layer contains the current word or word to be predicted. Hidden layer can contain the number of dimensions we want to represent current word present in the output layer.

Representation of CBOW

In simple words, you pass the context word to the model and model is capable of providing the expected word as output. You can call that as current word too. Complex case could be possible that you have to fill in the blanks as problem statement. And, context of words could n previous words and n next words. CBOW is capable of providing the output for such problems.

SkipGram can be called opposite of CBOW. That way, it will be easy to understand. As I said it is opposite of CBOW, it is capable of finding context of words given the current word. So, model representation consists of input layer, hidden layer and output layer. Input layer contains the current word as input. As earlier, hidden layer is the number of dimensions we want to represent our current word as an input. And, then output layer will have context of words. You can correlate it to multiple use cases.

Representation of SkipGram

As we can see, in the above diagram, we are passing current to get output as multiple context words before and after the current word.

This is the good overview of NLP’s important step of convert word to vector. Gaining knowledge of two important methods : CBOW and Skipgram, will help to understand the topic in depth. You can easily correlated your problem statement to these terms and will be able to give sound solutions. There are other things if you go in depth like forward propagation, backward propagation, loss function etc to work on model training etc.

To solve our problem statement, we can explore pre trained word embeddings and you can get good output. But pre-trained embeddings has their own benefits and limitations which need to be considered. First and foremost, I can see that they are not easy to interpret and so on. Overall, many other options can be explored and learnt. This blog would have given you good depth of new terms and knowledge gaining. You can apply this to your coding and code some model to practice in actual. NLP is growing and learning basic terms will always help to understand the spectrum / width / strength of NLP. Happy Leanring and Coding!

Word to Vector : Understanding the Concept

Challenges/Limitations

Footer