After loading BERT large model for Q&A when you print the model it will show all the layers and sub layers of the model. BERT large has 24 bert layer, 1 BertEmbedding layer at beginning and qa_output layer at the end.
In this blog I will talk about sections in the Embedding layer and not the bert layer as I am assuming that readers already know about self attention , attention heads , query , key , value , layer_norm and finally feed forward output layer. Now let’s understand sub layer inside BertEmbeddings.
word_embeddings : At the beginning there is a section in the embedding layer called word_embedding which has a matrix of size (30522,1024). This matrix can be thought of as a dictionary where keys are token ids and values are 1024(bert-large) dimensional vectors. Note that this matrix is trainable and the model will update vectors (by back propagation ) corresponding to only those tokens which are present in the training data while fine tuning the model.
Remember that this embedding matrix is just a dictionary which gives the same vectors for the same token, it is the last hidden layer output where we can expect different embedding for the same word that captures context information.
position_embeddings : BERT-base and BERT-large both can accept sentences whose token lengths is less than 512. so before feeding inputs to the model ensure that the length of tokens should be less than or equal to 512. Unlike LSTM or RNN, here we feed the entire sequence to the model in one go so the model has to do something to understand the sequential information(which word comes before and which comes after). Model adds positional embedding to each token to do so. I think that positional matrix is not trainable because it is used just to tell the model which word comes before and which comes later. So finally input to this layer is output of word_embedding layer which is 1024 dimensional vector for each input token and position_embedding add another 1024 dimensional vector for every token to hold the sequential information.
token_type_embeddings: Remember that the BERT model is pre-trained on a large dataset using masked language model and next sentence prediction. So by default the model expects two sentences as input which are separated by [SEP] token though tasks like classification and sentiment analysis do not feed two sentences as input. In task like classification we feed additional attention mask that tells model that which token is coming from sentences and which are just padding tokens to match the length. In tasks like QnA where we have to feed two sentences first as passage and second for question so we must provide additional segment ids (0’s for first segment and 1’s for second segment). we will see this in detail when we pass an input to our Q&A model.
So what is token_type_embedding : Embedding(2,1024) that we can see when we print the model.
We feed additional segment ids along with token ids but the token embeddings are 1024 dimensional vectors. I think that token_type_embeddings provides 1024 dimensional zero vectors for segment id =0 and 1024 dimensional vectors with ones for segment id = 1.
Before talking about last linear layer let’s take our first eg. to test the model BERT-large-for-Que-Ans. BertTokenizer provides an inbuilt module called ‘encode’ that takes a pair of question and passage sentences and performs three tasks. First it performs tokenization to get the tokens from vocabulary for the entire input. then it places [CLS] token at the start and [SEP] token at the end of first sequence and after the end of last sequence. then it gives token ids for all the tokens.