Consider the task of simple Image Classification (recognising the contents of an image and assigning a certain label to them: for example, a picture of the Taj Mahal can be classified/ assigned the label of “monument”.)
In the earlier examples of applying kernels to images and performing convolutions, we observed that we used hand-crafted kernels such as edge detectors and sharpeners to extract features from images.
Instead of using handcrafted kernels, can we let the model decide on the best kernels for a given input image? Can we enable the model to learn multiple kernels on its own, in addition to learning the weights of the classifier?
Convolutional Neural Networks aim to achieve the exact same goal. Kernels can be treated as parameters and learnt in addition to the weights of the classifier, using backpropagation.
But how is this different from a regular feed-forward neural network?
A neural network to classify an image based on digits.
Consider the example in the network shown above. An image (4px * 4px) can be flattened into a linear array of 16 input nodes for a neural network.
We observe that there are a lot of dense connections, which not just lead to heavier computation, but also loss in consistency of extracted features.
Contrast this to the case of convolution.
A convolution operation leads to sparser connections.
Convolution takes advantage of the structure of the image, as it is important to know that interactions between neighboring pixels are more interesting and significant for the determination of what the entire picture represents.
Moreover, convolution leads to sparse connectivity which reduces the number of parameters in the model.
But is sparse connectivity really a good thing? Aren’t we losing information by losing interaction between neighbouring input pixels?
- Well, not really. If anything, losing interaction can prove to be beneficial as the model progresses.
- Consider the case of neurons x1 and x5. They don’t interact with each other directly on layer 1.
- However, they happen to interact at layer 2, where their respective characteristics are more profound.
Weight Sharing is another advantage of Convolutional Neural Networks. Essentially, we can apply different kernels at all locations in an image and the kernels will be shared by all the locations. That way, the job of learning parameters (kernels) becomes more distributed and easier (instead of trying to learn the same weights / kernels at different locations again and again).
Here’s what a complete Convolutional Neural Network looks like:
A complete Convolutional Neural Network with alternating convolution and pooling layers.
What does a pooling layer do?
Max-pooling takes the largest value covered by the filter over the feature map.
As shown in the GIF above, pooling generally reduces the size of feature map obtained after a convolution operation. Max-pooling generally includes the maximum value overlapped by the kernel in the output.
Average pooling takes the average of all values overlapped by the kernel.
How do we train a convolutional neural network?
Convolution ensures sparse connections and lesser parameters.
A CNN can be trained as a regular feedforward neural network, wherein only a few weights are active (in colour).
The rest of the weights (in gray) are zero, and the final outcome is a neural network consisting of sparse connections.
Thus, we can train a convolutional neural network using backpropagation by thinking of it as a feedforward neural network with sparse connections.
Visualizing patches which maximally activate a neuron
Tracing back to aspects of an image which stand out
- Consider some neurons in a given layer of a CNN.
- We can feed in images to this CNN and identify the images which cause these neurons to fire.
- We can then trace back to the patch in the image which causes these neurons to fire.
- In an experiment conducted in 2014, scientists considered neurons in the pool5 layer and found patches which caused the neurons to fire.
- One neuron fired for people’s faces
- One neuron fired for dog snouts
- Another fired for flowers, while another fired for flowers.
So how do we visualize filters in the first place?
Recall that we’d done something similar with autoencoders. We’re interested in finding an input which maximally excites a neuron.
Turns out, the input which will maximally activate a neuron is the normalized version of the filter, as per an optimization problem modelled as follows:
The denominator of the solution is the norm of ‘w’.
As mentioned earlier, we think of CNNs as feed-forward neural networks with sparse connections and weight sharing. Hence, the solution is the same here as well, since the parameter weights are nothing but the filters.
Thus, filters can be thought of as pattern detectors.
- Typically, we’re interested in understanding which portions of the image are responsible for maximizing the probability of a certain class.
- We could occlude (gray out) different patches in the image and see the effect on the predicted probability of the correct class.
- For example, these heatmaps show that occluding the main features of images result in huge drops in prediction probability.
So how can we gauge the influence of input pixels?
- We can think of images as grids of (m x n) pixels.
- We’re interested in finding the influence of each of these inputs(xi) on a given neuron(hj).
- In other words, we turn to gradients to understand the extent of dependency on certain input pixels.
- We could just compute partial derivatives of activation functions at a middle layer w.r.t. the input and visualize the gradient matrix as an image itself.
The gradient matrix shows pixels which have great influence on classification.
Hence, we can conclude:
Gradients can be computed by representing CNNs as feedforward neural networks and using backpropagation.
Computer Vision is a branch that thrives on the usage of Convolutional Neural Networks. There are many more concepts such as Guided Backpropagation, Deep Art, Deep fake and Deep dream among many others which are worth surfing the web for.
This issue marks the end of the Deep Learning month from FACE. We’ll be back next week with a new issue on a new domain. Stay tuned!