The job of finding the best set of weights is conducted by the optimiser. In neural networks, the optimisation method used is stochastic gradient descent.
Every time period, or epoch, the stochastic gradient descent algorithm will repeat a certain set of steps in order to find the best weights.
- Start with some initial value for the weights
- Keep updating weights that we know will reduce the cost function
- Stop when we have reached the minimum error on our dataset
Gradient Descent requires a differentiable algorithm, because when we come to finding the minimum value, we do this by calculating the gradient of our current position and then deciding which direction to move to get to our gradient of 0. We know that the point at which the gradient of our error function is equal to 0 is the minimum point on the curve, as the diagrams below show.
The algorithm we iterate over, step 2 of our gradient descent algorithm, takes our current weight and subtracts from it the differentiated cost function multiplied by what is called a learning rate, the size of which determines how quickly we converge to or diverge from the minimum value. I have an explanation in greater detail on the process of gradient descent in my article on Linear Regression.
Over and Underfitting
Overfitting and Underfitting are two of the most important concepts of machine learning, because they can help give you an idea of whether your ML algorithm is capable of its true purpose, being unleashed to the world and encountering new unseen data.
Mathematically, overfitting is defined as the situation where the accuracy on your training data is greater than the accuracy on your testing data. Underfitting is generally defined as poor performance on both the training and testing side.
So what do these two actually tell us about our model? Well, in the case of overfitting, we can essentially infer that our model does not generalise well to unseen data. It has taken the training data and instead of finding these complex, sophisticated relationships we are looking for, it has built a rigid framework based on the observed behaviour, taking the training data as gospel. This model doesn’t have any predictive power, because it has attached itself too strongly to the initial data it was provided, instead of trying to generalise and adapt to slightly different datasets.
In the case of underfitting, we find the opposite, that our model has not attached itself to the data at all. Similar to before, the model has been unable to find strong relationships, but in this case, it has generated loose rules to provide crude estimations of the data, rather than anything concrete. An underfit model will therefore also perform poorly on training data because of its lack of understanding of the relationships between the variables.
Avoiding underfitting is generally more straightforward than its counterpart, because general belief is that an underfit model is one that isn’t complex enough. We can avoid underfitting by adding layers, neurons or features to our model or increasing the training time.
Some of the methods used to avoid overfititng are simply the direct opposites of avoiding underfitting. We can remove some features, particularly those that are correlated with others already present in the dataset or that have very little correlation with our output. Stopping the model earlier also ensures that we capture a more general model, instead of allowing it to over-analyse our data.
In some cases, overfitting may occur due to a model’s over-reliance on a certain set of weights, or path in our neural network. The model may have found, during training, that a certain set of weights in a section of our neural network provide a very strong correlation with the output, but this is more a coincidence than the discovery of an actual relationship. If this occurs, then when presented with testing data, the model will not be able to deliver the same level of accuracy.
Our solution here is to introduce the concept of dropout. The concept behind dropout is to essentially to exclude a section of the network every step of our training process. This will help us generate weights that are more even across the entire network and ensure that our model is not too reliant on any one subsection.