I typically use the tokenizer.encode_plus() function to tokenize my input, but there is another function that can be used to tokenize input, and this tokenizer.encode(). Here is an example of this:
encoding = tokenizer.encode(text, return_tensors = "pt")
The main difference between tokenizer.encode_plus() and tokenizer.encode() is that tokenizer.encode_plus() returns more information. Specifically, it returns the actual input ids, the attention masks, and the token type ids, and it returns all of these in a dictionary. tokenizer.encode() only returns the input ids, and it returns this either as a list or a tensor depending on the parameter, return_tensors = “pt”.
Masked Language Modeling
Masked Language Modeling is the task of decoding a masked token in a sentence. In simple terms, it is the task of filling in the blanks.
Instead of just getting the best candidate word to replace the mask token, I will demonstrate how you can take the top 10 replacement words for the mask token, and here is how you can do this:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torchtokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True)text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
word = tokenizer.decode([token])
new_sentence = text.replace(tokenizer.mask_token, word)
print(new_sentence)
Hugging Face is set up such that for the tasks that it has pre-trained models for, you have to download/import that specific model. In this case, we have to download the Bert For Masked Language Modeling model, whereas the tokenizer is the same for all different models as I said in the section above.
Masked Language Modeling works by inserting a mask token at the desired position where you want to predict the best candidate word that would go in that position. You can simply insert the mask token by concatenating it at the desired position in your input like I did above. The Bert Model for Masked Language Modeling predicts the best word/token in its vocabulary that would replace that word. The logits are the output of the BERT Model before a softmax activation function is applied to the output of BERT. In order to get the logits, we have to specify return_dict = True in the parameters when initializing the model, otherwise, the above code will result in a compilation error. After we pass the input encoding into the BERT Model, we can get the logits simply by specifying output.logits, which returns a tensor, and after this we can finally apply a softmax activation function to the logits. By applying a softmax onto the output of BERT, we get probabilistic distributions for each of the words in BERT’s vocabulary. Word’s with a higher probability value will be better candidate replacement words for the mask token. In order to get the tensor of softmax values of all the words in BERT’s vocabulary for replacing the mask token, we can specify the masked token index, which we get using torch.where(). Because in this particular example I am retrieving the top 10 candidate replacement words for the mask token(you can get more than 10 by adjusting the parameter accordingly), I used the torch.topk() function, which allows you to retrieve the top k values in a given tensor, and it returns a tensor containing those top k values. After this, the process becomes relatively simple, as all we have to do is iterate through the tensor, and replace the mask token in the sentence with the candidate token. Here is the output the code above compiles:
The capital of France, paris, contains the Eiffel Tower.
The capital of France, lyon, contains the Eiffel Tower.
The capital of France, lille, contains the Eiffel Tower.
The capital of France, toulouse, contains the Eiffel Tower.
The capital of France, marseille, contains the Eiffel Tower.
The capital of France, orleans, contains the Eiffel Tower.
The capital of France, strasbourg, contains the Eiffel Tower.
The capital of France, nice, contains the Eiffel Tower.
The capital of France, cannes, contains the Eiffel Tower.
The capital of France, versailles, contains the Eiffel Tower.
and you can see that Paris is indeed the top candidate replacement word for the mask token.
If you want to only get the top candidate word, you can do this:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict = True)
text = "The capital of France, " + tokenizer.mask_token + ",
contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
logits = model(**input)
logits = logits.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_word = torch.argmax(mask_word, dim=1)
print(tokenizer.decode(top_word))
Instead of using torch.topk() for retrieving the top 10 values, we just use torch.argmax(), which returns the index of the maximum value in the tensor. The rest of the code is pretty much the same thing as the original code.
Language Modeling
Language Modeling is the task of predicting the best word to follow or continue a sentence given all the words already in the sentence.
import transformersfrom transformers import BertTokenizer, BertLMHeadModel
import torch
from torch.nn import functional as F
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertLMHeadModel.from_pretrained('bert-base-uncased',
return_dict=True, is_decoder = True)
text = "A knife is very "
input = tokenizer.encode_plus(text, return_tensors = "pt")
output = model(**input).logits[:, -1, :]
softmax = F.softmax(output, -1)
index = torch.argmax(softmax, dim = -1)
x = tokenizer.decode(index)
print(x)
Language Modeling works very similarly to Masked language modeling. To start off, we have to download the specific Bert Language Model Head Model, which is essentially a BERT model with a language modeling head on top of it. One additional parameter we have to specify while instantiating this model is the is_decoder = True parameter. We have to specify this parameter if we want to use this model as a standalone model for predicting the next best word in the sequence. The rest of the code is relatively the same as the one in masked language modeling: we have to retrieve the logits of the model, but instead of specifying the index to be that of the masked token, we just have to take the logits of the last hidden state of the model(using -1 index), compute the softmax of these logits, find the largest probability value in the vocabulary, and decode and print this token.
Next Sentence Prediction
Next Sentence Prediction is the task of predicting whether one sentence follows another sentence. Here is my code for this:
from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
from torch.nn import functional as Ftokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')prompt = "The child came home from school."next_sentence = "He played soccer after school."encoding = tokenizer.encode_plus(prompt, next_sentence, return_tensors='pt')
outputs = model(**encoding)[0]
softmax = F.softmax(outputs, dim = 1)
print(softmax)
Next Sentence prediction is the task of predicting how good a sentence is a next sentence for a given sentence. In this case, “The child came home from school.” is the given sentence and we are trying to predict whether “He played soccer after school.” is the next sentence. To do this, the BERT tokenizer automatically inserts a [SEP] token in between the sentences, which represents the separation between the two sentences, and the specific Bert For Next Sentence Prediction model predicts two values of whether the sentence is the next sentence. Bert returns two values in a tensor: the first value represents whether the second sentence is a continuation of the first, and the second value represents whether the second sentence is a random sequence or not a good continuation of the first. Unlike Language Modeling, we don’t retrieve any logits because we are not trying to compute a softmax on the vocabulary of BERT; we are simply trying to compute a softmax on the two values that BERT for next sentence prediction returns so that we can see which value has the highest probability value, and this will represent whether the second sentence is a good next sentence for the first. Once we get the softmax values, we can simply look at the tensor by printing it out. Here are the values that I got:
tensor([[0.9953, 0.0047]])
Because the first value is considerably higher than the second index, BERT believes that the second sentence follows the first sentence, which is the correct answer.
Extractive Question Answering
Extractive Question Answering is the task of answering a question given some context text by outputting the start and end indexes of where the answer lies in the context. Here is my code for extractive question answering:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-
uncased')
question = "What is the capital of France?"
text = "The capital of France is Paris."
inputs = tokenizer.encode_plus(question, text, return_tensors='pt')
start, end = model(**inputs)
start_max = torch.argmax(F.softmax(start, dim = -1))
end_max = torch.argmax(F.softmax(end, dim = -1)) + 1 ## add one ##because of python list indexing
answer = tokenizer.decode(inputs["input_ids"][0][start_max : end_max])
print(answer)
Similar to the other three tasks, we begin by downloading the specific BERT model for Question Answering, and we tokenize our two inputs: the question and the context. Unlike the other models, the process is relatively straightforward for this model as it outputs the values for each word in the tokenized input. As I mentioned before, the way extractive question answering works is by computing the best start and end indexes for where the answer is located in the context. The model returns values for all of the words in context/input corresponding to how good they would be a start value and end value for the given question; in other words, each of the words in the input receives a start and end index score/value representing whether they would be a good start word for the answer or a good end word for the answer. The rest of this process is fairly similar to what we did on the other three programs; we compute the softmax of these scores to find the probabilistic distribution of values, retrieve the highest values for both the start and end tensors using torch.argmax(), and find the actual tokens that correspond to this start : end range in the input and decode them and print them out.
Using BERT for any task you want
Although Text Summarization, Question answering, and a basic Language Model are especially important, often, people want to use BERT for other unspecified tasks, especially in research. The way that they do this is by taking the raw outputs of the stacked encoders of BERT, and attaching their own specific model to it, most commonly a linear layer, and then fine-tuning this model on their specific dataset. When doing this in Pytorch using the Hugging Face transformer library, it is best to set this up as a Pytorch deep learning model like such:
from transformers import BertModel
class Bert_Model(nn.Module):
def __init__(self, class):
super(Bert_Model, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.out = nn.Linear(self.bert.config.hidden_size, classes)
def forward(self, input):
_, output = self.bert(**input)
out = self.out(output)
return out
As you can see, instead of downloading a specific BERT Model already designed for a specific task like Question Answering, I downloaded the raw pre-trained BertModel, which does not come with any heads attached to it.
To get the size of the raw BERT outputs, simply use self.bert.config.hidden_size, and attach this to the number of classes you want your linear layer to output.
To use the code above for sentiment analysis, which is surprisingly a task that does not come downloaded/already done in the hugging face transformer library, you can simply add a sigmoid activation function onto the end of the linear layer and specify the classes to equal 1.
from transformers import BertModel
class Bert_Model(nn.Module):
def __init__(self, class):
super(Bert_Model, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.out = nn.Linear(self.bert.config.hidden_size, classes)
self.sigmoid = nn.Sigmoid() def forward(self, input, attention_mask):
_, output = self.bert(input, attention_mask = attention_mask)
out = self.sigmoid(self.out(output))
return out