Question Answering (Part 5): Using BERT (Bidirectional Encoder Representations from Transformers)…

In the previous part of this series we had a look at transformers. Transformers have become the work horse of Natural Language Processing tasks. Transformers are based on attention that allows them to use the entire input text while maintaining focus on the important parts of the input text. Transformers use a feed-forward neural network that scales with parallel processing infrastructure when compared to the Recurrent Neural Networks (RNN) used earlier for text processing.

BERT is a popular character in American TV show Sesame Street [Source: pixabay.com]

A word embedding is a vector representation of a word. Machine Learning algorithms use word embeddings. In this article we will look at BERT (Bidirectional Encoder Representations from Transformers) a word embedding developed using transformers, and how BERT can be used for Question Answering.

BERT uses fine tuning to adopt for a specific text processing task

BERT uses fine-tuning based pre-training . BERT is pre-trained on a large data set. BERT uses a fine-tuning based approach. The BERT model used for creating the word embeddings is not trained on a specific task. The pre-trained weights from the BERT word embeddings are fine tuned for a specific downstream task.

Pre-trained BERT model is fine tuned for specific tasks. The output layer is changed as per the need of the task. The weights of the pre trained BERT model are used to initialise the weights for the specific task. [Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]

Input Embedding for the BERT model

The BERT model can take as its input both a pair of sentences or a single sentence. The input text is encoded using WordPiece token embeddings with 30000 token vocabulary. WordPiece essentially uses tokens instead of words as vocabulary, and selects the tokens so that it helps create a better language model. The input begins with a special token [CLS], sentences are separated by stork [SEP]. Segment embedding information denotes whether a token belongs to sentence A or sentence B. Position embedding represents the position of the token in the sentence. A sum of token, segment and position embedding is used as input to the BERT model.

Input to BERT model is sum of the token, segment and position embeddings. [Source: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]

Pre-training the BERT model

BERT uses the encoder from the transformer. The transformer is bi-directional. The BERT model is pre-trained on the following tasks:

Masked Language Model (MLM): A percentage of the input tokens is masked at random, and then the model is used to predict the masked tokens. 15% of the randomly selected input tokens are masked: a) 80% of the times with [MASK] token, b) random token 10% of the times, c) left unchanged 10% of the times.
Next Sentence Prediction (NSP): Recall that the input to the BERT model are two sentences A and B separated by a [SEP] token. In this task we train the model to predict whether B is the actual next sentences after A in the training corpus

BERT model is pre-trained on BookCorpus and English Wikipedia data.

Once the BERT model is pre-trained we can fine tune the model for specific tasks. The input sentences A and B are replaced with task specific input and a task specific output layer is added.

Fine Tuning BERT for Question Answering

To fine tune BERT for Question answering tasks: a) the input pair of sentences is a Question and the answer, b) two linear layer with softmax is used to get the start and end span probabilities

Let us look at how pre-trained BERT model is fine tuned for Question Answering task. As we mentioned earlier BERT takes as its input a pair of sentences. For Question Answering task, the first sentence is the question, and the second sentence is the answer. BERT layer is used to convert the question and context (passage) into embedding. A linear layer with softmax layer is used to find the start span and end span probability.

We find a vector S that is multiplied with the embedding for token at position i to obtain the probability that position i is the start span. Softmax is applied on the probabilities for all the tokens to obtain the normalized probability. Similalry we find a vector E that is multiplied with the embedding for token at position i to obtain the probability that position i is the end span. Softmax is again applied on the probabilities for all the tokens to obtain the normalized probability. The score of a candidate span from position i to j is the sum of the normalized probability of i being the start token and j being the end token. The highest scoring span in which j ≥ i is chosen is the model prediction.

Now that we have covered all the background let us jump to some code. In the next part of this series we will look at code to build a Question Answering system using BERT based deep learning model.

BERT uses fine tuning to adopt for a specific text processing task

Input Embedding for the BERT model

Pre-training the BERT model

Footer