Deep Learning for Coders, Lesson 1

Questions

What do you need for deep learning? Not much maths, not much data, no expensive computers, and no PhD.
Name five areas where deep learning is now the best in the world. Natural language processing — answering questions, speech recognition, summarising documents. Computer vision. Medicine. Biology. Recommendation systems.
What was the name of the first device that was based on the principle of the artificial neuron? The Perceptron.
Based on the book of the same name, what are the requirements for parallel distributed processing (PDP)? A set of processing units that have a state of activation, an output functions for each unit, a pattern of connectivity among units, a propagation rule for propagating patterns of activities through the network of connectivity’s, an activation rule for combining the inputs impinging on a unit with the current state of the unit to produce an output for the unit, a learning rule whereby patterns of connectivity are modified by experience, and an environment within which the system must operate.
What were the two theoretical misunderstandings that held back the field of neural networks? The first misunderstanding was that a single layer of perceptrons was unable to learn simple mathematical functions. However, this limitation could be overcome by using multiple layers of the devices. The second misunderstanding came when researchers only used two layers of neurons. Whilst this overcame the above difficulty, it was still slow and big. Thus, multiple layers of neurons allowed better performance. This refers to deep learning.
What is a GPU? Graphics Processing Unit (GPU): Also known as a graphics card. A special kind of processor in your computer that can handle thousands of single tasks at the same time, especially designed for displaying 3D environments on a computer for playing games. These same basic tasks are very similar to what neural networks do, such that GPUs can run neural networks hundreds of times faster than regular CPUs. All modern computers contain a GPU, but few contain the right kind of GPU necessary for deep learning.
Why is it hard to use a traditional computer program to recognise images in a photo? We would have to write down the exact steps necessary. Since we don’t know the steps we take when we go to recognise an image (subconscious), we can’t actually do this.
What did Samuel mean by “weight assignment”? Weights are variables, and a weight assignment is a particular choice of values for those variables. The program’s inputs are values that it processes in order to produce its results. The program’s weight assignments are other values that define how the program will operate.
What term do we normally use in deep learning for what Samuel called “weights”? Parameters.
Draw a picture that summarises Samuel’s view of a machine learning model.

Samuel’s view of a machine learning model.

10. Why is it hard to understand why a deep learning model makes a particular prediction? This is a highly-researched topic known as interpretability of deep learning modules. Deep learning modules are hard to understand in part due to their ‘deep nature’. Think of a linear regression model. Simply, we have some input variables/data that are multiplied by some weights, giving us an output. We can understand which variables are more important and which are less important based on their weights. A similar logic might apply for a small neural network with 1–3 layers. However, deep neural networks have hundreds, if not thousands, of layers. It is difficult to determine which factors are important in determining the final output. The neurons in the network interact with each other, with the outputs of some neurons feeding to other neurons. Altogether, due to the complex nature of deep learning models, it is very difficult to understand why a neural network makes a given prediction.

However, in some cases, recent research has made it easier to better understand a neural network’s prediction. For example, we can analyse the sets of weights and determine what kind of features activate the neurons. When applying CNNs to images, we can also see which parts of the images highly activate the model.

11. What is the name of the theorem that shows that a neural network can solve any mathematical problem to any level of accuracy? Universal approximation theorem. Essentially, it’s a function (a neural network) that is extremely flexible and can be used to solve any problem just by varying the weights. The theorem shows that a neural network can solve any mathematical problem to any level of accuracy. The process used to find these weights is called stochastic gradient descent. SGD can find these ideal weights required above automatically.

12. What do you need in order to train a model? Labelled data.

13. How could a feedback loop impact the rollout of a predictive policing model? A predictive policing model is created based on where arrests have been made in the past. In practice, this is not actually predicting crime, but rather predicting arrests, and is therefore partially simply reflecting biases in existing policing processes. Law enforcement officers then might use that model to decide where to focus their police activity, resulting in increased arrests in those areas. Data on these additional arrests would then be fed back in to retrain future versions of the model.

14. Do we always have to use 224×224-pixel images with the cat recognition model? No — only do it for historical reasons. If you increase the size, the results of the model will improve at the price of speed and memory consumption.

15. What is the difference between classification and regression? Classification model attempts to predict a class or category. That is, it’s predicting from a number of discrete possibilities, such as dog or cat. Regression model attempts to predict one or more numeric quantities, such as location or temperature.

16. What is a validation set? What is a test set? Why do we need them? Validation set is used to measure the accuracy of the model. The samples held out are selected randomly. The training set is used to fit the model.

17. What will fastai do if you don’t provide a validation set? Defaults valid_pct = 0.2.

18. Can we always use a random sample for a validation set? Why or why not? No — set random seed to the same value every time, which means we get the same validation set every time we run it. This way, if we change our model and retrain it, we know that any differences are due to the changes to the model, not due to having a different random validation set.

19. What is overfitting? Provide an example. When a model makes great predictions on the training set, but not on the validation set.

20. What is a metric? How does it differ from “loss”? A metric is a function that measures the quality of the model’s predictions using the validation set, and is printed at the end of each epoch. A good metric is one that is easy to understand. The purpose of a loss is to define a measure of performance that the training system can use to update weights automatically. In other words, a good choice for a loss is one that is easy for stochastic gradient descent to use.

21. How can pre-trained models help? Pre-trained models set the weights in your model to values that have already been trained by experts on a different dataset. Should use it because it means that your model is already capable before you’ve even shown it any of the data.

22. What is the “head” of a model? When using a pre-trained model, cnn_learner will remove the last layer (typically customised to the original training task), and replace it with one or more new layers with randomised weights, of an appropriate size for the dataset. This is called the head.

23. What kinds of features do the early layers of a CNN find? How about the later layers? Diagonal, horizontal and vertical edges, as well as various different gradients. Then corners, repeating lines, circles, and other simple patterns. Then higher-level components.

24. What is an “architecture”? A general template for how that kind of model works internally the functional form of the model.

25. What is segmentation? Creating a model that can recognise the content of every individual pixel in an image.

26. What is y_range used for? When do we need it? We use it when we’re predicting a continuous number, rather than a category, to tell fastai what range our target has. For example, if we’re predicting movie ratings on a scale of 0.5 to 5.0, then y_range = (0.5, 5.5).

27. What are “hyperparameters”? We rarely build a model just by training its weight parameters once. Instead, we are likely to explore many versions of a model through various modelling choices regarding network architecture, learning rates, data augmentation strategies, and other factors we will discuss in upcoming chapters. Many of these choices can be described as choices of hyperparameters. The word reflects that they are parameters about parameters, since they are the higher-level choices that govern the meaning of the weight parameters.

28. What’s the best way to avoid failures when using AI in an organisation? Using a test set as well as a validation set. The problem is that even though the ordinary training process is only looking at predictions on the training data when it learns values for the weight parameters, the same is not true of us. We, as modellers, are evaluating the model by looking at predictions on the validation data when we decide to explore new hyperparameter values! So subsequent versions of the model are, indirectly, shaped by us having seen the validation data. Just as the automatic training process is in danger of overfitting the training data, we are in danger of overfitting the validation data through human trial and error and exploration.

The solution to this conundrum is to introduce another level of even more highly reserved data, the test set. Just as we hold back the validation data from the training process, we must hold back the test set data even from ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts. In effect, we define a hierarchy of cuts of our data, based on how fully we want to hide it from training and modelling processes: training data is fully exposed, the validation data is less exposed, and test data is totally hidden. This hierarchy parallels the different kinds of modelling and evaluation processes themselves — the automatic training process with back propagation, the more manual process of trying different hyper-parameters between training sessions, and the assessment of our final result.

Questions

Footer