Popular Machine Learning Interview Questions

Get ready for your next job interview requiring domain knowledge in machine learning with answers to these eleven common questions.

Interviews are hard and stressful enough and my goal here is to help you prepare for ML interviews. This list is not conclusive of all interview questions nor guaranteed to help you pass the interview. It’s basically a list of questions I gathered from sitting on many interviews as an interviewer.

The expected answer should mention supervised, unsupervised, and reinforcement learning.

Supervised Learning You give the algorithm labeled data and the algorithm has to learn from it and figure out how to solve future similar problems. Think of it as if you’re giving the algorithm problems and answers. The algorithm has to learn how these problems were solved in order to solve future problems in a similar manner. This is like the above example where the bank learns from your habits which credit card transactions are legit and which are fraudulent.

Unsupervised Learning You give the algorithm a problem without any labeled data or any prior knowledge of what the answer could be. Think of it as if you’re giving the algorithm problems without any answers. The algorithm has to find the best answer by driving insights from the data. This is similar to a bank clustering its customers according to various parameters and deciding who’s eligible for a credit card offer, line of credit offer, and who isn’t eligible for any offers. This is usually done using a Machine Learning method called K-Means.

Reinforcement Learning This is when the algorithm learns from its own experience using reward and punishment. The easiest example is self-driving cars, where there is an agent that learns from each move it makes. A positive move toward the target earns the agent a reward, while a negative move away from the target earns the agent a punishment.

Here I usually expect to hear the 3 words: Classification, Regression, and clustering. These are some of the most popular and basic uses for Machine Learning.

Classification and Regression mainly use supervised learning, and the candidate can give an example showing how historical data is used to train the model.

For example, if someone steals your credit card and makes an online transaction. You will probably get an email or text from your bank asking to verify this transaction. Otherwise, the bank will consider it fraud. Your bank’s algorithm learned your credit card purchasing habits through your purchase history, and when an abnormal transaction was detected, the bank suspected it’s a fraud. This is a form of Machine Learning, and probably it’s decision tree Classification.

Another example is a car company trying to predict sales for next year based on this year’s numbers and historical data, which is a form of Machine Learning and could be linear Regression.

Clustering mainly uses unsupervised learning where there is no historical data. A simple example is the spam email filter where the algorithm examines different parts of all incoming emails, group them together, then cluster the emails into spam and ham.

The answer should be around overfitting.

It seems the model is learning the exact dataset characteristics rather than capturing its features, and it is called overfitting the model. Probably the model is very complex in comparison to the dataset. The model is complex in terms of having many layers and neurons than needed.

Depending on the situation, there are several ways to fix this overfitting model. The most common are early stopping and dropout regularization.

Early stopping is what it sounds like: stop the training early once you start seeing the drop in the accuracy. Dropout regularization is dropping some outputs layers or nodes. Thus, the remaining nodes have different weights and have to do extra work to capture the characteristics.

This is kinda related to the previous question. The answer should include simple models that underfit, complex models that overfit, and the fact that both Bias and Variance can’t be minimized at the same time.

High Bias means the model is simple and can’t capture many features during the training phase, aka underfitting model. High Variance means the model is complex and is not only capturing features but also learning anything but those specific training set features, which is also referred to as overfitting.

Image by Author.

As you can see, there is a sweet spot in the middle to balance both Bias and Variance. If your model shift to the right side, then it’s getting more complicated, thus increasing variance and resulting in overfitting. If your model shifts to the left, then it’s getting too simple, thus increasing bias and results in underfitting.

A good data scientist knows how to tradeoff bias and variance by tuning the model’s hyperparameters, thus achieving optimum model complexity.

A simple model means a small number of neurons and fewer layers, while a complex model means a large number of neurons and several layers.

Confusion Matrix is used to assess the performance of supervised learning models only and can’t be used with unsupervised models.

Confusion Matrix. Source: Everything You Wanted to Know about Machine Learning but Were Too Afraid to ask.

Confusion Matrix is a way to present the 4 outcomes of the model: True Positive, False Positive, False Negative, and True Negative. Recall, Precision, Accuracy, and F1 can all be calculated from the Confusion Matrix.

Type 1 error is when your algorithm makes a positive prediction, but in fact, it’s negative. For example, your algorithm predicted a patient has cancer, but in fact, he doesn’t.

Type 2 error is when your algorithm makes a negative prediction, but in fact, it’s positive. For example, your algorithm predicted a patient doesn’t have cancer, but in fact, they do.

The learning rate is a tuning parameter that determines the step size of each iteration (epoch) during model training. The step size is how fast (or slow) you update your neurons’ weights in response to an estimated error. Model weights are updated using the backpropagation error method. So, the input will flow from the input nodes of your model through the neurons to the output nodes then the error is determined and backpropagated to update the neuron’s (model) weights. How fast to update those neurons’ weights is the learning rate.

Image by Author.

If the learning rate is high, thus the model weights are updated fast and frequently, then your model will converge fast, but it may overshoot the true error minima. This means a faster but erroneous model.

If the learning rate is low, thus the model weights are updated slowly, then your model will take a long time to converge but will not overshoot the true error minima. This means a slower but more accurate model.

This question is related to the previous one. Here I expect a quick explanation of the gradient descent and how backpropagation affects it.

Think of gradient descent as the weights used to update your neural network during the backpropagation from output to input nodes. Think of Activation as the equation tied to each neuron in your model. This equation decides if this neuron should be activated or not depending on the neuron’s input relevancy to the model prediction.

In some cases, when you have a deep neural network with several layers and based on your choice of the activation function (along with other hyper-parameters), the gradients will become very small and may vanish while backpropagating from the output to input nodes through the layers of the network. The problem here is the weights of the neurons in your model won’t get updated (or get updated with very small values). Thus, your model won’t learn (or will get minimal learning). This is a clear case of a vanishing gradient descent problem.

I’m personally surprised by how many candidates confuse these two. The answer should state the fact that KNN is a supervised model used for classification, and K-means is an unsupervised model used for clustering. Then the candidate should give an example of classification and another of clustering.

This is another easy one where the answer should include testing the model on new data that the model has never seen before. The best example is when you use Scikit Learn (or any other library) to split your data into training and test set. The test set data is used to cross-validate your model after it is trained so you can assess how well your model is performing.

Precision: This is the answer for: out of all the times the model said positive, how many were really positive. You care about precision when False Positive is important to your output.

Precision.

Let’s say you’re a small company and you send samples to potential customers who might buy your product. You don’t want to send samples to customers that will never buy your product no matter what. The customer who gets a sample but doesn’t buy your product is false positive because you predicted they would buy your product (Predicted = 1), but actually, they never will (Actual = 0). In such cases, you want to decrease the FP as much as you can in order to have high precision.

Recall.

Recall: This is the answer for: out of the actual positives, how many were classified correctly. You care about the recall when False Negative is important to your output. Let’s take an example of your credit card. Someone stole your credit card number and used it to purchase stuff online from a sketchy website that you never visit. That’s clearly a fraudulent transaction, but unfortunately, your banks’ algorithm didn’t catch it. What happened here is that your bank predicted it’s not a fraud (predicted = 0), but it was actually a fraud (actual = 1). In such a case, your bank should develop a fraud detection algorithm that decreases the FN, thus increases the recall.

This is when your dataset has too many features, so it’s hard for your model to learn and extract those features.

Two main things could happen

· More features than observations thus the risk of overfitting the model

· Too many features, observations become harder to cluster. Too many dimensions cause every observation in the dataset to appear equidistant from all others, and no meaningful clusters can be formed

The main technique to solve this problem is Principal Component Analysis (PCA).

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

Finally, I hope these sample questions and answers help you prepare for your upcoming interview. They could also serve as a refresher to your Machine Learning knowledge.

Footer