Understanding Bert Usage

In this article, I will be showing basic usage of Bert, and visualizing the Bert embedding without any training. There are plenty of nice tutorials for learning Bert, I will just try to visualize some aspects, so I assume you already have the basic knowledge about Bert. In the next article I will fine tune Bert and try to show what changes after training.( Next article Link )

The code is at github Link
Use nbviewer to see colors better. Link

Now let’s summarize the road to Bert very very briefly:

Before we had Glove which generates fixed embedding.(whatever sentence we have, a word have fixed embedding. Then with Word2vec we begin using context on embedding creation.)
At some point with ELMO(bi-LSTM) we began to use context of sentence for word embedding.(Concatenate embedding of left and right)
Then Bert came by using left and right at same time, simultaneously with Transformer mechanism. Transformer was a big step for NLP.

We can use transformers very efficiently for language translation(English -> German), Encoder processing input (English), Decoder processing (Input and Output). Bert is a Language Model, it is a pre-trained model for generating vectors to be used in downstream tasks. So we only need Encoder part.

Bert is a Masked Language Model. In auto-regressive Language models, each word is predicted, conditioning on previous words. In Bert, word prediction depends(condition) on rest of all words. Bert replaces some tokens with special keyword [MASK] and tries to predict it. Now output is not auto-regressive(left -> right) , it is computed for whole sentence at once, conditioned on non-masked tokens.(***This has some drawbacks, since masked tokens are assumed to be independent)

How to use a Trained network for our problems?
Bert is a huge network generating embedding, can we use it in any task?
Bert is trained in 2 task, Masked Language Modeling and Next Sentence Prediction, so always keep in mind, if your task is not any of these, using Bert outputs directly could not give very good results. At this article I will use it directly, and later article I will fine-tune Bert.

Bert architecture has a BertEmbeddings layer which has “Embedding(30522, 768, padding_idx=0)” as 1st element. 30522 is vocabulary size and 768 Bert standard embedding vector size. There is also a “position_embeddings” and “token_type_embeddings” which are expected as input. When we supply these 3 inputs it generates a [batch_size x seq_len x 768] vector for later layers.

After the 1st layer, Bert has 12 layers of transformer layers called “BertLayer”. ( Source Code). As in any neural network former layers learn basic features and later layers learn data distribution of training objective. Below we will demonstrate outputs of all these layers. Basically we can think BertLayers , layers that created vectors of size 768 consecutively.

BertLayer = BertAttention + BertIntermediate + BertOutput

Then we have a linear layer Pooler at end. This give us a useful vector for classification problems.

Below we can see Bert model itself has 110 million parameters. In 2nd article we will combine 1 linear layer to this Bert and calculate the parameter number again. You will see it will only add near 2000 new parameters. And we will train network,and weights will change according to our objective.

The data I am using here is simple 36 English sentences as above.

Sample Encoding with Bert
For using Bert in a simple way can check code below. The sample sentence is “i eat apple”. We append special keywords 1st. Then split into tokens. Change those splits to indexed_tokens in Bert vocabulary. Then we create segment vectors.(For simple 1 sentence encoding it is 1s everywhere.)

The result of above code is as below. You can see all the input and vector and output vector.

Word Embedding with Bert
When Bert generates vectors, it is doing it with piece-word, because Bert has a limited vocabulary(which is very good for low dimension 30k, this is small for NLP). So when you have a mistyped or unknown(rare) words the data that returns from Bert will be different than what you expect. So always return a key-vector pair.

For using individual word embedding, u can try to visualize the embedding created for your your problem by Bert. Check the 12 transformer layer, check their combinations, or use all of them(mean or sum).

Below I will try a simple visualization of word embedding. Since Bert generates vector based on context, the vector for “I” in “I eat apple” and “I eat bread” must be different.
“apple eating I” is different than “bread eating I” .

Or we can check “eat” ,
“action of 1st singular person on apple”
“action of 1st singular person on bread”
Even these are same word “eat”, actions have different word vectors. This means Bert is catching context.(If we were using Glove, Glove will return same vectors for “eat” )

Eat vector for 12 sentences

Above you can see clusters of “can”, “want” and “verb only”. Also you can see sub clusters of “i” and “we”. This means even we have “eat” vector for 12 sentences, vector changes according to sentence.

Now I filtered 36 sentences with verb eat and I have 12 sentences. I have 12 “eat” vectors for 12 sentences. We can check cosine similarity of these vectors. Below table shows cosine between all vectors. You can see that same modal verb or same subject sentences get highest points because they have very similar context. Worst points are for modal verb vs plain verb sentences(I can eat bread x we eat apple). As you see context is very different.(Open in new tab in image is blurred, or check github for image)

Cosine of Each Vector

Sentence Embedding with Bert

There are lots of ways to generate sentence embedding for Bert. In fact there are even experiments for different tasks. You can search them on net. Here I try to test some practical use cases.

CLS: Special token representing the vector for the whole sentence. For using this we must check 0th dimension of final layer(last hidden state ) returned by model.
Pooler Output: This is a vector of last layer of Bert architecture, at the end of network there is a linear layer for classification tasks.
Hidden Layers = 1st embedding layer + 12 hidden layers. The interesting thing about these layers is, they are trained for a special problem and probably will not fit into your problem perfectly.

Since at any network first layers capture low level features and later ones capture high level(semantic, shape, group..) , last layers of BERT needs a bit fine tuning according task. But since Bert is trained on a large corpus, it is still a very good vectorizer nearly for all NLP problems.

In internet you will mostly see, geeks using last 4 layers of Bert. These 4 layers have 768 dimension. You can concatenate these (768×4) or sum ( 768) or mean ( 768). Summing is not a good idea because long sentences will generate bigger sums.

If your problem is so complicated and things do not work for you, you can try CNN style Convolution with word vectors. It can give very good results. Like images are 2D vectors, texts are word(rows) x embedding(columns) 2D structures.

Now I will try these in my small dataset. The dataset is about 3 verbs(eat, drink, read) with 6

objects(eat -> [bread, apple],drink ->[water, beer],read ->[book, newspaper]), with 2 subject(I, we)…

Now I will generate word and sentence vector with this dataset and check the quality of the vectors generated. As we stated there are multiple ways to create vectors, below I show 3 of them. I think “Sentence by Hidden Layers” is a good distribution for these sentences. Even nothing trained yet, It separates read and (eat, drink) nicely.(I don’t blame other vectors, they do not have to separate according to this task, because they are not trained for this.)

The result of above code generates 3 graphs each showing the distribution of mentioned method for sentence generation.

Bert Hidden Layers

Now let’s check what all intermediate steps learned. Below I dump Embedding layer and 12 transformer layers. If you check Embedding, Transformer1, Transformer2, Transformer3 have good distribution according to verb clusters but then groups become to be mixed(because Bert not trained for my dataset). This show in fact how first layers of networks make feature factorization.

Above we can see individual layers of Bert according to what they learned. If you apply this logic to your problem, you will have a basic understanding of what Bert can do for you.

In this article, I summarized the architecture very simply. Then we go over Word, Sentence embedding. At last we check what hidden layers learned. At next article I will train the Bert, so you can see what is changed when we train.

Footer