Training of a 1000 epochs begins with a single gradient descent
It is fascinating that so many proverbs have endured the test of time and are still used in literature and daily conversations. The beauty of proverbs is that so many people can relate to them. This can be observed both from the abundance of synonymous proverbs and the number of proverbs that have spread cross-culture and cross-language.
As a research student who spends time with neural networks, I thought it would be fun to rephrase some of these well-known proverbs using AI terminology, and see how well they preserve the meaning of the original. The hope is that this will make neural network jargon more relatable and approachable. The meaning of the proverb is written beneath each translation, so whether you are knowledgeable in the field, you can try to guess the original proverb from the description.
All the below translations are my own work. Feel free to use these quotes in your daily life (at your own personal risk of scaring away your friends), but please reference/link this article if you want to use any of these in written form.
Disclaimer: some of the quotes below are not technically-speaking always true, but that goes the same with proverbs, so please take it easy. I’d be opened to any suggestions for improvements:)
1. Training of a 1000 epochs begins with a single gradient descent
Even the longest and most difficult ventures have a starting point; something which begins with one first step. —[see answer]
An epoch refers to one cycle through the full training dataset. Usually each epoch is further broken down into several mini-batches. A neural network is trained by applying gradient descent to its parameters for every mini-batch.
2. All that has a derivative of zero is not a global optimum
Not everything that looks precious or true (or optimal) turns out to be so. — [see answer]
The goal of machine learning is to find a set of parameters which optimises an objective function. At a global optimum (the best possible solution), the derivative of the objective function becomes zero, but that is also true for local minima, maxima and saddle points.
3. Set a stupid objective function, get a stupid prediction
If one asks a strange or nonsensical question, the listener will probably respond with a similarly strange or nonsensical answer. — [see answer]
The objective function should define the problem an AI should solve; otherwise the prediction that the AI makes will be meaningless.
4. Bad gradients propagate fast
Bad news circulates quickly because people often spread it everywhere. — [see answer
Out-of-distribution training samples, often known as outliers, will most likely result in large losses and gradients. Regularisation techniques and dropout may be good measures to counter overfitting to these samples.
5. Interpretability lies in the eyes of the researcher
Different people have different views on what is beautiful (or interpretable). — [see answer]
It is often not clear how a neural network is making a prediction just by inspecting its parameters and intermediate outputs. Many methods have been developed to make it more interpretable, but it remains an actively investigated topic in AI research.
6. Convergence of loss comes to those who wait
A patient person will be satisfied in due time; patience is a virtue. — [see answer]
Big neural networks can take a very long time to converge, but when it does, it often outperforms smaller ones.
7. A watched plot never improves
A process appears to go more slowly if one waits for it rather than engaging in other activities. — [see answer]
Here, a plot means a graph that shows the loss over training.
8. Data leakage leads to overfitting
One does not profit by cheating. — [see answer]
In machine learning, it is common practice to split data into training, validation and test datasets, and only use the training dataset for training. One could cheat by leaking information from the validation and test datasets into the training dataset, but then the model will most likely fail in real-world settings because it has overfitted to the dataset.
9. Don’t delete your checkpoints
Don’t do something which forces you to continue with a particular course of action, and make it impossible for you to return to an earlier situation. — [see answer]
Checkpoints are network parameters that are saved during training after every couple of epochs.
10. There’s no point crying over killed processes
To worry about unfortunate events which have already happened and which cannot be changed. — [see answer]
Always save your intermediate results since you have to start from scratch if your process dies mid-training:(
11. Don’t change model architectures in mid-training
To change one’s plan or approach when an effort is already underway or at another inopportune time. — [see answer]
Usually, once you define your model, you won’t be able to change it mid-training if you don’t want to start from scratch again. (Instead, you can turn the training on and off for sub-components of your model.)
12. Don’t judge a network by the number of parameters
One shouldn’t prejudge the worth or value of something by its outward appearance alone. — [see answer]
Unfortunately, size does matter for neural networks, but that’s not all. A cleverly designed architecture with shared parameters and additional constraints can go a long way, and also makes the network more robust and generalisable.
13. Don’t put all your weights on one feature
To make everything dependent on only one thing; to place all one’s resources in one place, account, etc. — [see answer]
Typically you would have many channels in a neural network layer, each channel computing some kind of feature. A robust network will not rely on one channel or feature alone, but makes its decision based on a combination of different features. Dropout is one strategy to ensure such robustness.
14. Don’t filter out the signal together with noise
To discard, especially inadvertently, something valuable while in the process of removing or rejecting something unwanted. — [see answer]
15. Don’t try to segment before you can classify
You must master a basic skill before you are able to learn more complex things. — [see answer]
Image classification is simpler and is a necessary step to image segmentation, which is a problem of classifying every pixel in an image instead of just classifying the entire image.
16. Reinitialise pre-trained embeddings
To spoil one’s plans or hope of success. — [see answer]
Pre-trained embeddings are learnt features that can help accelerate downstream tasks.
17. Returns are maximised by agents who self-play
You cannot depend solely on divine help, but must work yourself to get what you want. — [see answer]
A return is a discounted sum of future rewards. In reinforcement learning, the objective is to find a strategy to maximise the expected return. In the famous example of AlphaZero, an AI learnt to play Go, chess and Shogi at a super-human level, just by playing against itself.
18. It’s all neural network parameters to me
A way of saying that something is difficult to understand. — [see answer]
19. No model can be optimised for two objective functions
You cannot work for two different people, organisations, or purposes in good faith, because you will end up favoring one over the other. — [see answer]
An objective function defines the problem the AI must solve. If you give it two objectives that contradict each other, the AI can’t give an optimal solution to both simultaneously.
20. The gradient is always steeper on the other side
People always think they would be happier in a different set of circumstances. — [see answer]
Steep gradients are good because that means you have more scope to improve your parameters and reach a better solution (but not too steep).
21. Adversary and loss make a network wise
We gain wisdom faster in difficult times than in prosperous times. — [see answer]
Defining a loss as an objective to minimise is a typical way to train a neural network. A family of networks called Generative Adversarial Networks also employs adversity to train itself. Typical applications are for realistic image generation.
22. GPT-3 wasn’t trained in a day
It takes a lot of time to achieve something important. — [see answer]
GPT-3 is a massive state-of-the-art network that can perform a variety of language-related tasks.
23. Multi-head attention is better than single-head
It is better to have the power of two people’s minds to solve a problem or come up with an idea than just one person on their own. — [see answer]
Multi-head attention is a mechanism used in Transformers, a neural network architecture that has been shown to be highly effective at capturing the complexity of language. GPT-3 uses Transformers as its building block.
24. Where there’s a gradient, there’s a loss
Every rumor has some foundation; when things appear suspicious, something is wrong. — [see answer]
25. Don’t put a fully-connected layer before a convolutional layer
Do not do things in the wrong order. — [see answer]
Convolutional layers typically appear in computer vision tasks. It is very likely that a fully-connected layer comes after a convolutional layer, not before (although nothing is impossible).