I’m sure many of us are curious about the mathematics behind such algorithms — how does mathematics factor into these algorithms, and how can the manipulation of mathematical systems produce such stunning results on par with detecting COVID-19?
Although the mathematical terms surrounding deep learning, such as “gradient descent”, “backpropagation”, “matrix multiplication” and so on may sound pretty intimidating at first glance, fret not! We will break down these terms by first explaining the goal of every deep learning task and introducing concepts including weight initialization, activation functions, minimizing loss through gradient descent and backpropagation to allow the neural network to learn.
We will then cover the details of Convolutional Neural Networks (CNN) — delving into the individual layers (convolution layer, pooling layer and the fully connected layer) and tieing them all together to ultimately lead to a successful COVID-19 diagnosis.
If the mathematical notation scares you, don’t worry. You don’t need to understand what every single line means; just follow along with the text and hopefully, by the end of this post, you will have a better idea of how exactly these seemingly disparate mathematical concepts come together to drive deep learning and enable the detection of COVID-19!
The goal of deep learning is to approximate some function f* where y=f*(X) maps an input X to a category y. We typically define the input as a vector X ∈ ℝⁿ where each entry xᵢ of the vector represents a particular feature. For example, the features of an image are usually the values of the pixels in the image. During training, the neural network defines a mapping y = f(X; W) and learns the value of the parameters W that results in the best approximation of f*. The output y will be either a 0 or a 1, representing whether the patient is COVID-19 negative (0) or positive (1), y ∈ {0, 1}.
In order to learn this mapping, we first propagate the information through a neuron, the fundamental building block of neural networks. Each of these inputs x₁, x₂, … xₙ will be initialized with a corresponding weight w₁, w₂, … wₙ where W ∈ ℝⁿ. The dot product of X and W is then passed through a nonlinear activation function, g, that produces the predicted output ŷ. In addition to X, there is also a bias term w₀ which allows us to shift the activation function to the left or to the right regardless of the inputs.
Putting this together,
An equivalent representation in summation would be
The goal of deep learning is therefore to find the optimal set of weights W that will yield the mapping y = f(X; W) that best approximates f*.
Activation Function
We apply a non-linear activation function because XᵀW is not bounded between 0 and 1. Thus we use the sigmoid function that transforms the real number into a scalar output between 0 and 1. Following the scaling, y will be defined as:
Another purpose of the activation function is to introduce non-linearities into the network. This is important because, in real life, most of the data we encounter is non-linear, i.e. they are not separable by a straight line. An activation function therefore allows a non-linear mapping between X and y that better approximates f*.
Some commonly used activation functions include sigmoid, tanh, ReLU, and they all serve to introduce non-linearities into the network.
Minimizing Loss
After calculating y for the particular inputs x⁽ᶦ⁾ and W, how do find the optimal mapping f? This is where the “learning” part comes in. Firstly, we have to tell the network when it is wrong. This is done by quantifying the error, known as loss, by taking the prediction y and comparing it to the true answer, f*(X). The empirical loss, J(W), measures the average loss over the entire dataset consisting of n samples.
The goal of “learning” is therefore to find the optimal weights W* that will yield the smallest loss and hence the closest approximation of f*.
Gradient Descent
At the start of training, we first initialize the weights randomly. In order to move towards the smallest loss, we calculate the gradient of the loss with respect to each of the weights to understand the direction of maximum ascent. With this information, we then take a step in the opposite direction of the gradient to move towards the point with a lower loss. We then repeat this process until we converge to a local minimum.
In practice, the neural network will comprise of more than one neuron. It is the composition of multiple neurons that makes the neural network so powerful.
So how can we compute the gradient through multiple neurons? Through backpropagation that uses the chain rule!
Backpropagation
Take for instance a neural network with two neurons arranged sequentially, x and z₁ with corresponding weights w₁ and w₂ respectively.
If we change w₂ slightly, it will change the output ŷ. In order to calculate the gradient, we then apply the chain rule as such:
If we were to change w₁ instead, then we would perform the following:
For deeper neural networks, we then continuously apply the chain rule from the output to input to compute the gradient and quantify how the change in the weight has affected the loss.
After computing the gradient, how do we decide how big a step to take in the opposite direction of the gradient? That is determined by the learning rate n. Setting the learning rate is extremely crucial in training neural networks because a small learning rate would converge extremely slowly and might be stuck in false local minima, while a large learning rate could overshoot and never converge. Therefore, we do not use a fixed learning rate but instead an adaptive learning rate that “adapts” to the loss landscape depending on factors including how fast the learning is happening, size of particular weights, and so on.
We have now completed our overview of how training happens in neural networks. Next, let us move our focus to how exactly we can use deep learning to diagnose COVID-19 from the X-ray chest scan.
When dealing with image data like X-ray scans, it works better to treat input X as a matrix of size b x h, because this is how an image is represented to a computer — as a two-dimensional matrix of numbers. If we were to pass in X as a vector of pixel values (which is one-dimensional), we would lose all the spatial information available in the original picture. Thus, we tweak the original architecture of the neural network to a convolutional neural network (CNN) which preserves the spatial structure as input. This is done by connecting patches of the input image to each neuron such that each neuron is only connected to one region of the input. The input image undergoes several convolutional layers which consist of first a convolution operation followed by an activation function and lastly a pooling operation, all of which will be detailed below.
Convolution Layer
Firstly, a kernel is slid across the input image of size b x h. Kernels are matrices of size bₖ x hₖ (where bₖ < b and hₖ < h) which contain the weights W, whereby a kernel of size bₖ x hₖ will have bₖ x hₖ different weights. We apply this same kernel to each bₖ x hₖ patches in the input, starting from the top-left corner and moving to the next patch through a sliding window.
The kernel extracts features from the image which can inform the network about the diagnosis of the patient. These features are extracted through the convolution operation, which involves performing an element-wise multiplication of every pixel in the filter with the bₖ x hₖ patch of the input image. We then sum up all the values for every sliding action to obtain the output of this layer. The image that has undergone a convolution is known as the feature map.
Different kernels contain a different set of weights, so with multiple kernels, different features can be extracted from the image. If k kernels were applied to the image, then the resulting feature map would be of size b x h x k. These features can include sharp edges, curves, texture and so on. What makes neural networks so powerful is that these features are not hard-coded by humans. Instead, through the training process of backpropagation to reduce loss, the neural network finds the optimal weights for each kernel and therefore extracts the features that are the most important to diagnose the patient.
Pooling Layer
After the convolution operation, the feature map is then passed through an activation function. With the non-linear output, the feature map then undergoes pooling which reduces the size of the image. The motivation behind this is to enable the neural network to learn features that are invariant to small translations of the input. This is important because not all X-ray scans may be taken in the same exact orientation. Small differences in the position of the patient while the X-ray is taken, or even small variances that exist between different scanners may result in slightly different scans. Pooling therefore allows the neural network to be invariant to these tiny differences and thus allow it to be more applicable to a larger range of scans. This is done by preserving only the maximum value in that particular patch of the image, known as maxpooling. Take for instance an input image of size 4×4. In each 2×2 patch of the image, the largest value (bolded) will constitute the new feature map, as shown below:
This allows for invariance because if the image were to be slightly translated, the maximum value will remain approximately constant.
For instance, in the image below, despite all the pixels being shifted by one unit to the right such that the new image differs from the initial image by four values, the resulting layer after maxpooling only differs from the initial by one value (italicized).
Fully Connected Layer
After maxpooling is performed, the last stage is to connect the feature map to a fully connected layer consisting of multiple neurons. The convolutional layers (convolution, activation, pooling) provide meaningful and invariant features that have been extracted from the image, and the final fully connected layer learns a non-linear function that maps the extracted features to y, the diagnosis of the patient. The final feature map after multiple convolutional layers, say of size b1 x h1 x k, will be flattened into a 1D feature vector I ∈ ℝᵇ ˡ ˣʰ ˡ ˣᵏ, similar to that of X ∈ ℝⁿ we encountered earlier. I will have its own set of corresponding weights, and we compute the dot product of I and W followed by a nonlinear activation function to produce the final prediction y.
With this modified architecture, the model then learns with gradient descent and through that process, we will hopefully obtain a model that classifies X-ray scans with high accuracy.
So there you have it! I’ve provided you with a high-level overview of the mathematics behind these COVID-19 classification systems. Hopefully you’ll have learnt a bit and will go on to explore these concepts in greater depth!
In the meantime, let’s all stay safe and healthy~
References
[1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[2] Paul, H. Y., Kim, T. K., & Lin, C. T. (2020). Generalizability of Deep Learning Tuberculosis Classifier to COVID-19 Chest Radiographs: New Tricks for an Old Algorithm?. Journal of thoracic imaging, 35(4), W102-W104.