This article provides insights into the very basics of Deep Metric learning methods and how it helps in achieving state of the art results for tasks like face verification and face recognition. During my past internship, I have been working on these tasks, and seeing the power of deep metric learning methods for such tasks has influenced me in writing this article. Besides Face Recognition/Verification there are a number of other applications where Deep Metric Learning has proven to be quite effective; Anomaly Detection, three-dimensional(3D) modelling, are a few of them.
- Prerequisites
- A tour of the basics
*. Face verification, Face Recognition
*. ANNs (Training phase and Inference phase) - Image Classification for face verification?
*. One-shot learning - Metric
- Metric Learning
*. Mahalanobis Distance Metric - Deep Metric Learning
*. Contrastive Loss — Siamese Networks
*. Triplet loss — Triplet Networks
*. Softmax loss
*. A-Softmax loss
*. Large Margin Cosine Loss (LMCL)
*. Arcface loss - References
The reader requires a good knowledge of linear algebra and familiarity with the basic concepts in machine learning to understand this article. I hope you will enjoy learning deep metric learning.
Keywords: Metric learning, Triplet loss, softmax loss, Face recognition, Face verification, DCNN, Arcface, Sphereface, Cosface.
Let us first understand a few basic terminologies and establish a solid ground to enhance our understanding of deep metric learning.
Face verification is the task of determining whether a given pair of images belong to the same person or not. In simple words, it is a task where given an image, we try to answer the question. Is that you?. A 1:1 authentication problem.
Face Recognition, on the other hand, is a combination of two tasks: Face identification and Face verification. Face identification is the task of recognizing a person from a database of images. A task where given an image database (let’s say a gallery of k persons) we try to answer the question, who are you?. So, Face recognition is a 1:k authentication problem.
It is quite helpful to know the basic training and inference process of a simple Artificial neural network. {please skip this part if you are familiar with the topic}.
TRAINING PHASE
- We randomly initialize the weights and biases according to some probability distribution,
- Feed the data set into the untrained neural network architecture.
- Forward propagation: the hidden layer accepts the input data, applies an activation function, and passes the activations to the next successive layer. We propagate the results until we get the predicted result.
- After generating the predictions, we calculate the loss by comparing the predicted results to the ground truth labels.
- Backward propagation: We compute the gradients of the loss wrt to weights and biases, and adjust them by subtracting a small quantity proportional to the gradient.
- We repeat steps 1 to 5 for the entire training set — this is one epoch. Repeat for more epochs until eventually our error is minimized or our prediction score is maximized.
INFERENCE PHASE
In this phase, no calculation of gradients or adjustment of parameters takes place. It uses the same sets of weights and biases for the evaluation on an unknown test dataset which the network hasn’t seen before.
Deep neural networks are the go-to models for achieving the state of art performance when it comes to computer vision tasks. Learning invariant and discriminative features from data is the very fundamental goal for achieving good results for any computer vision task. Deep learning methods have been proven to be quite effective for feature learning, the very reason being the ability of deep learning methods to learn hierarchical feature representations by building high-level features from low-level ones.
Can we use Image Classification to solve the task of Image Verification? We might be able to learn a quite robust Deep Convolutional Neural network that performs excellently on classifying all the employee images in an organization and also takes into account factors like poses, expression, and illumination.
But, this is usually achieved when we have a good amount of data. By good amount I mean 1000’s of examples for each class/employee. In Image Classification, if the number of data points per class is small, it might lead to overfitting and yield very poor results. Also, Image Classification generally works well when the number of classes is small.
However, It is generally not the case with Person/Image verification tasks. In fact, it is quite opposite. Here, we usually have a very large number of classes and the number of examples per class is quite small. And this is where one-shot learning comes into the picture.
One-shot Learning: A classification problem that aims to learn about object categories from one/few training examples/images. [Wikipedia]. In simple words, given just one example/image of a person, you need to recognize him/her. To build a face recognition system, we need to solve this one-shot learning problem.
But deep neural nets usually require vast amounts of data to train on to excel at a particular task, which is not always available. Deep learning models won’t work well with just one training example, {one-shot learning problem}. How to address this issue? We learn a similarity function, which helps us to solve the one-shot learning problem.
Let us start to understand deep metric learning by understanding the basics.
A Metric is a non-negative function between two points x and y{say g(x,y)} that describes the so-called notion of ‘distance’ between these two points. There are several properties that a metric must satisfy:
- Non-negativity => d(x,y) ≥ 0 and d(x,y) = 0, iff x = y.
- Triangular inequality => d(x,y) ≤ d(x,z) + d(z,y).
- Symmetry => g(x,y) = g(y,x).
EXAMPLES:
A. The Euclidean Metric: In a ‘d’ dimensional vector space, the metric is:
B. The Discrete Metric: The metric is given by:
What does a basic machine learning algorithm do? — Given the data and the corresponding output labels, the goal is to come up with a set of rules or some complex function that maps those inputs to the corresponding output labels.
One of the simplest machine learning algorithms where distance information is captured is the KNN (k- Nearest Neighbours) algorithm, where the idea is to find a neighborhood of k nearest data points for a new data point and assign this data point to the class to which the majority of the k data points belong.
Similarly, the goal of metric learning is to learn a similarity function from data. Metric Learning aims to learn data embeddings/feature vectors in a way that reduces the distance between feature vectors corresponding to faces belonging to the same person and increases the distance between the feature vectors corresponding to different faces.
Euclidean distances are less meaningful in high dimensions. They fail to capture nonlinearity in data because of a number of reasons:
- They represent an isotropic (same in every direction) distance metric.
- These distances don’t capture the class structure.
One can use non-isotropic distances that use properties of the data to capture the structure of class relationships. Mahalanobis Distance Metric is one such metric. The distance between two samples on the metric space is given by:
Here, M is the inverse of the covariance matrix and acts as a weight term to the square Euclidean distance. The Mahalanobis distance equation represents a dissimilarity measure between two random vectors x and y, following the same distribution with Covariance Matrix Σ.
If M = I(Identity Matrix) we have the special case of Mahalanobis distance — Euclidean Distance, which proves an underlying assumption of the Euclidean Distance that the features are independent of each other.
The equation(*) can be interpreted as a linear projection in euclidean space. So, W has a linear transformation property because of which Euclidean distance in transformed space is equal to the Mahalanobis distance in the original space.
This distance metric provides a new data representation in the transformed space which is easily able to distinguish the items of different classes. This Linear transformation shows the inter-class separability power this approach posses. So, Metric Learning is basically a goal to learn this transformation matrix W.
Metric learning is an approach based directly on a distance metric that aims to establish similarity or dissimilarity between images. Deep Metric Learning on the other hand uses Neural Networks to automatically learn discriminative features from the images and then compute the metric.
Why there is a need for deep metric learning?
- Faces of the same person when presented in different poses, expressions, illuminations, etc. might be able to fool a face verification/face recognition system very efficiently.
- The task of building an efficient, large scale Face Verification system is radically a task of designing appropriate loss functions that best discriminate the classes under study. In recent years, Metric/Distance learning using Deep learning has been shown to output highly satisfying results for many computer vision tasks such as face recognition, face verification, image classification, Anomaly detection, etc.
- Metric Learning only has a limited capability to capture non-linearity in the data.
Deep Metric Learning helps capture Non-Linear feature structure by learning a non-linear transformation of the feature space.
There are two ways in which we can leverage deep metric learning for the task of face verification and recognition:
1. Designing appropriate loss functions for the problem.
Most widely used loss functions for deep metric learning are the contrastive loss and the triplet loss.
** Contrastive Loss — Siamese Networks:
One of the very fundamental ideas where explicit metric learning is performed is the siamese network model. Siamese network is a Symmetric neural network architecture consisting of two identical subnetworks that share the same sets of parameters (hence computing the same function) and learns by calculating a metric between highest level feature encodings of each subnetwork each with a distinct input.