Clear understanding on when to use which type
In my last blog about NLP I had taken topics of Bag of Words, tokenization, TF-IDF, Word2Vec, all of these had a problem like they don’t store the information of semantics.
Semantics means the sequence of words, like which word was spoken before or after which word.
It is important information to keep in language processing if we have to interpret it in a right way.
For example, if I say “You are beautiful but not intelligent” and we are not able to keep semantics, the sentence may mean differently.
In RNN we keep the information of sequence so that a meaningful sentence can be made, translated or interpreted.
Problems with standard neural network
1) Inputs and outputs can be of different length in different examples.
2) The learning across different positions of text is not maintained, hence semantic information is not maintained neither do context. This necessary for speech recognition, text generation, text or voice semantic recognition.
Types of Recurrent Neural Network
1) Many to Many
a) With same number of input and outputs
b) With different number of inputs and outputs
2) Many to One
3) One to Many
4) One to One
Many to Many Architecture of RNN (Equal Inputs and Outputs)
Examples of problems using such architecture are:
1) Video classification where we wish to label each frame of the video
2) Name entity recognition
In this architecture, sequence of inputs is maintained and outputs are given simultaneously. Series x is basically a sentence and each word is fed into a neural network which gives an output y as well as o which is again fed into next neural network layer to give a kind of context on what is being talked about.
So next word generation is based on the words that have come in sequence before.
Here in forward propagation outputs can be represented by :
Where w1 is weight assigned to o1 and by is bias of y, f is any activation function like Relu, TanH, etc.
Examples can be identifying name entities in a sentence like “Adam lists his Manhattan house for $37.5 million”, here Adam is name, Manhattan is location and $37.5 is monitory value.
y11=f(w1*o1+ by)
y12=f(w1*o2+ by)
y13=f(w1*o3+ by)
y14=f(w1*o4+ by)
o1 output is context output which is represented by following equations:
here again f is an activation function , and w is weight given to input x11, which is first word as input, bo is bias applied to output o.
o1=f (x11 *w + o0 * w’ + bo)
o2=f (x12 *w + o1*w’+ bo)
o3=f (x13 *w + o2*w’+ bo)
o4=f (x14 *w + o3*w’+ bo)
Many to Many Architecture of RNN (Unequal Inputs and Outputs)
Example of problem using such architecture is:
- Language translation
In this architecture which is basically used in language translation jobs where input and output can be of different numbers for example:
English Sentence: How are you doing? Can be translated to French as Comment allez vous?
See the difference between numbers of words, English words are four while French words are three.
So in architecture above, n is not equal to m.
While the sequence of words inputs and their context is maintained in encoder and decoder, the number of inputs and outputs can differ.
Many to One Architecture of RNN (Many inputs but one output)
Example of problem using such architecture is:
1) Sentiment Analysis
In this architecture the sequence and context of words are taken into consideration to know what the sentiment of it is. Like if it was a negative or positive sentence.
If someone reviews a movie as “The movie was not at all good”. This sentence has to be interpreted in sequence and also context has to take into consideration for sentiment out, it need not give many outputs but just one of its sentiment, hence many to one relation.
The above architecture can be explained as:
o1=f (x11 *w + o0 * w’)
o2=f (x12 *w + o1*w’)
o3=f (x13 *w + o2*w’)
o4=f (x14 *w + o3*w’)
Loss=y-y^
here o1 , o2… are context outputs which have sequence of words maintained, w is weight given to input while w’ is weight given to each output o, f is an activation function.
Weight Updation w’’, in Backward Propagation at time t4 will be:
W’’new= W’’ –dL/dw’’
By Chain rule w’’ is dependent on y^
W’’ new= W’’ –(dL/dy^ * dy^/dw’’)
Weight Updation w w.r.t x14 in Backward Propagation at time t4
Wnew= W –dL/dw
By Chain Rule w is dependent on o4 which in turn is dependent on y^
W new=W –(dL/dy^ * dy^/do4 * do4/dw)
Weight Updation w’ w.r.t o3 in Backward Propagation at time t3
W’new= W’ –dL/dw’
By Chain Rule w is dependent on o4 which in turn is dependent on y^
W’ new=W’ –(dL/dy^ * dy^/do4 * do4/dw’)
With so many multiplication of derivatives which is are very small values can result in vanishing gradient problem.
One to Many
Example of problem using such architecture is:
1) Music generation
2) Image captioning takes an image and outputs a sentence of words
In the architecture above only one input is given which generates a series of outputs.
Example can be music generation by just giving first note of music which is then generated to a series of outputs, while feeding output as an input at each step.
Also if an image is given to be interpreted, the output can be many words like a person riding a bike on a sunny day.
One to One
Example of problem using such architecture is:
1) Image classification
In this type of architecture only one input is given like an image of dog or cat and it has to specify or classify which image was it.
Diagram is simple to understand like x as input of image and y is given out as out as classification of that image after it was trained on dog/cat classification.
Advantages of RNN:
1) It has capacity to process input of any length
2) The model size does not increase with size of input
3) It computes historical information as well.
4) Weights are shared as per time.
The Short coming or disadvantage of RNN are:
1) Vanishing gradient problem: As the layers get deeper the dependencies and derivatives which are mostly between values of 0–0.25 for sigmoid are multiplied various times, resulting in smaller values which in result do not update the new weights significantly, resulting in no learning.
No convergence to global minima, i.e. point of minimum loss.
2) If we are using functions like Relu, the problem will be of exploding gradient which means weights will be so high, that weights will keep moving and not converge to global minima. This can however be resolved by gradient clipping, it is a technique by which maximum value for the gradient can be capped, known as controlled in practice.
3) In simple RNN for text generation, it can consider only context of last words given to it, not the words further. Which will be resolved by bi-directional RNN.
4) Computation can be slower.
5) It has difficulty of accessing information from a long time ago.
Thanks for reading!