NLP using Deep Learning Tutorials : Understand Loss Function

This article is a part of a series that I’m writing, and where I will try to address the topic of using Deep Learning in NLP. First of all, I was writing an article for an example of text classification using a perceptron, but I was thinking that will be better to review some basics before, as activation and loss functions.

Loss function also called the objective function, is one of the main bricks in supervised machine learning algorithm which is based on labeled data. A loss function guides the training algorithm to update parameters in the right way. In a much simple definition, a loss function takes a truth (y) and a prediction (ŷ) as input and gives a score of real value number. This value indicates how much the prediction is close to the truth. The higher this value is, the worse the model’s prediction is, and vice versa.

In this article, I present three of the mose used loss functions.

Mean Squared Error loss function, known as MSE, is most used in regression problems having continuous target (y) and prediction (ŷ) values. MSE is the average of the squares of the difference between the target and the predicted values. There are other alternatives for MSE, as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE), but all those functions are based on computing the real-valued distance between the targets and the predictions (Output).

Mathematical formula of MSE :

And an example of implementation using Pytorch :

import torch
import torch.nn as nn# Mean Squared Error Loss
mse_loss = nn.MSELoss()
outputs = torch.randn(3, 5, requires_grad=True)
targets = torch.randn(3, 5)
loss = mse_loss(outputs, targets)
print(f'Mean Squared Erro loss : {loss}')# Output 
# Mean Squared Erro loss : 3.128143787384033

The categorical Cross-Entropy Loss function is commonly used in multiclass classification, in which the outputs (ŷ) are the probabilities of the target classes. The target truth (y) is a vector of n elements that represent the true multinomial distribution. This requires two properties of (y) values: the sum of all elements is equal to one, and all elements are positive. If one class is correct, the (y) vector is a one-hot vector. The predicted output (ŷ) has the same properties as (y).

The mathematical formula of Cross-Entropy Loss is :

To better use Cross-Entropy Loss, you need to understand three mathematical aspects:

There is a limit to how small or how large a number can be. To avoid this case you can add a “Scaler function” to your outputs and/or inputs. (Ex: sklearn.preprocessing.StandardScaler)
If the input to the exponential function used in the softmax formula is a negative number, the resultant is an exponentially small number, and if it’s a positive number, the resultant is an exponentially large number.
And the log function is the inverse of the exponential function, which means that log(exp(x)) is equal to x.

So, to get optimized probability distribution using the Cross-Entropy Loss, in the training phase of your network you need to avoid the use of the softmax function. Then, when the model is trained, you can use the softmax function to get the prediction probabilities.

Finally, here is an example of implementation of CrossEntropy Loss using Pytorch :

import torch
import torch.nn as nn# Cross-entropy Loss
ce_loss = nn.CrossEntropyLoss()
outputs = torch.randn(3, 5, requires_grad=True)
targets = torch.tensor([1, 0, 3], dtype=torch.int64)
loss = ce_loss(outputs, targets)
print(f'Cross Entropy Loss : {loss}')# OutPut :
# Cross Entropy Loss : 1.7309303283691406

In this example, we assume that each input has one particular class. This is why the targets vector has three integer elements, representing the index of the correct class for each input.

The Binary Cross-Entropy Loss function is used in classification problems that involve discriminating between two classes, known as binary classification.

The mathematical formula is :

And here is an implementation example using Pytorch :

import torch
import torch.nn as nn# Binary Cross-Entropy Loss
bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()
probabilities = sigmoid(torch.randn(4, 1, requires_grad=True))
targets = torch.tensor([1, 0, 1, 0], dtype=torch.float32).view(4, 1)
loss = bce_loss(probabilities, targets)
print(f'This is probabilities : {probabilities}')
print(f'bce loss : {loss}')# Output 
# This is probabilities : tensor([[0.8276],
#        [0.4056],
#        [0.4190],
#        [0.5984]], grad_fn=<SigmoidBackward>)
# bce loss : 0.6229268312454224

In the example below, we created a binary probability output vector “probabilities” using the activation function sigmoid. Next, we instantiate a target vector of 0’s and 1’s, which represent the index of the two target classes. Finally, we use those two variables, probabilities and target, to calculate the loss value using Binary Cross-Entropy Function.

In this article, I presented three Loss functions. Note, also that Pytorch implements more loss function in its nn package that you can explore in this link. https://pytorch.org/docs/stable/nn.html#loss-functions
Each loss function is recommended for some cases. However, you must not hesitate to experiment with other loss functions in different cases when it’s possible.

References :

“Natural Language Processing with Pytorch” Book (https://www.amazon.fr/Natural-Language-Processing-Pytorch-Applications/dp/1491978236)

Footer