The Essence of Deep Learning

The process of adjusting weights which allows neural networks to learn is simple yet powerful. Consider the diagram below:

Image supplied by the author: Weights and Biases in a Neural Network.

Each neuron has a number inside it, called the bias. This typically ranges from 0 to 1, and the higher it is, the more a neuron is activated.

The activation of a neuron is influenced by this bias, as well as the inputs coming in. The inputs are multiplied by a weight. The higher the weight, the more it influences the result.

The resulting value is then passed through an activation function (which we cover later) that produces the final output from the neuron.

So, for each neuron, the following calculation is performed:

Output = activation function (bias + corresponding weights * corresponding inputs)

The below diagram shows a complete neural network:

Image supplied by the author: Weights and Biases in a Neural Network.

This is a small, fully-connected neural network of four layers. The input layer is where we feed our data (in examples) from which our neural network learns.

The output layer is where we get the target value. This represents what exactly our neural network is trying to predict (or learn). The output represents the prediction of the neural network given the input. All layers in between are called hidden layers.

All these concepts are represented mathematically in a neural network, primarily with matrices from linear algebra. Briefly, the input is represented as a matrix of numbers, and it is multiplied and transformed, producing the output.

In essence, a neural network is just a series of matrix operations and activation functions.

The math that powers this is linear algebra, which is very common in computer science, and a topic we will cover in the math behind deep learning.

Next, let’s understand activation functions because they allow us to overcome a key challenge.

Let’s say we are trying to encode a relationship between two variables, X and Y. For example, how does height (X) affect weight (Y)? Linear functions, however fancy, can only represent linear relationships. That means, they do poorly in the situation below:

Image supplied by the author:linear relationship vs non-linear relationship

In this diagram, we see that the curve does much better than the line at representing the relationship between X and Y. Activation functions allow neural networks to encode non-linear relationships, such as that curve.

In neural networks, there is a linear operation in each layer (transforming the input into the output), and then a non-linear activation function is applied which renders the output non-linear. Two examples follow:

Image supplied by the author: Two common activation functions

One of the most common activation functions is the Rectified Linear Unit, or ReLU. It leaves the input untouched unless it is negative. If the input is negative, it sends it to 0. This is extremely simple yet surprisingly effective.

Another common activation function is the Sigmoid, which squishes the input to between 0 and 1, with extreme values mapping close to 0 and 1 and intermediary ones towards the middle.

Now, we have all the components that make up a neural network. But how do they actually learn? How do we train them?

The goal is to train a network that produces outputs that closely match the desired or true outputs across a range of inputs. This optimized network is our goal, and we move towards it with backpropagation.

This involves some calculus, but don’t worry, we will keep it very simple.

Backpropagation

Neural networks are trained one example at a time. For each input example, the network produces an output. Based on this output, the network is given feedback indicating how well it performed. This helps it learn.

This general idea is applied in various kinds of neural network training, but we will discuss it more for supervised learning, which is the most common, and simplest form.

Supervised learning involves data with labels. This means, for each example, we know the correct answer. A common example is identifying the species of an animal from a given image.

Now suppose we have 10 animal species in total. For each species, the neural network will make a prediction between 0 and 1, indicating how likely it thinks it is that the input belongs to an animal from that species.

Suppose it says that it is 60% likely to be a cat, and 40% a dog, and that cat is the correct answer. Then, all the weights that contributed to the 60% likelihood of cat get strengthened, whereas all the weights that contributed to the 40% likelihood of a dog get weakened. And since all the other predictions are correctly 0%, those also get strengthened.

In addition, the weights get altered in proportion to how much they contributed. Suppose one neuron’s output heavily contributed to the dog prediction. Then the weight going from that neuron into the neuron that listened to it will be significantly reduced.

More formulaically, we first calculate the distance (difference) between the given and desired output. Then, we go back one layer at a time, and reward or penalize, proportionally, the neurons that contributed to or mitigated the distance.

This is called backpropagation.

Because neural networks are non-linear, backpropagation isn’t a simple linear calculation. This is where we need calculus — to decide how to adjust these weights.

Gradient Descent

One broad approach is called gradient descent, or hill climbing. Imagine an uneven landscape where you are trying to find the highest point (‘hill’). Imagine also that you are blindfolded.

How the strategy works is to check every direction, and walk towards the direction with the highest upward slope. One keeps doing this until every direction points downwards. In other words, you have climbed the hill.

However, since you are blindfolded, you don’t know if this is the highest hill or just a small one. To gauge that, you may want to record this height and coordinates (neural networks can do this by assessing accuracy) and then walk some more.

After enough exploration, the highest hill you found becomes the trained neural network. There are algorithms for gauging how to explore this landscape, and we also vary when to stop.

Putting these together, we have a formula for training neural networks: give it an example, measure its error, give it feedback on which side the hill is, climb, try again. After enough iterations, we stop, concluding that our network is trained.

We have now prepared all the ingredients to finish making our recipe. Let’s bake!

Neural networks, fundamentally, are a bunch of mathematical transformations that take an input and produce an output. These transformations are arranged into a structure called an architecture.

The task of machine learning is to build the correct architecture for these neural networks and tune various levers so that it will be effective in transforming the input into the desired output across a range of inputs.

Several key ingredients are needed. Neural networks themselves are structured like graphs inspired by the brain (with nodes and edges), which is really a convenient way of representing matrix operations. These matrix operations are linear.

Then, in each layer, we need non-linearity, which is added with an activation function. So, now we have a neural network, and its architecture, including the activation functions.

Next, we need to adjust its weights and biases to create the desired kind of network, which consistently gives us good results. This occurs through backpropagation, which often employs gradient descent.

So, these are the key ingredients behind a neural network, presented without any math. And don’t worry, we will cover the math, and do it painlessly.

Read more Deep Learning articles at https://deeplearningdemystified.com

Backpropagation

Gradient Descent

Footer