Text Classification: How BERT boost the performance

We will present three binary text classification models using CNN, LSTM, and BERT.

Data Preprocess

Because we get our data from social network like Twitter or Facebook, there are a lot of useless or noisy data in the original dataset. Before feeding data into NLP model for training, we need to clean our text data at first. I list some steps we followed below, you can modify any rules here for cleaning aspect.

drop null rule
retweet rule
hashtag rule
markup rule
url rule
email rule
number rule
remove punctuation
remove nonprintable
remove ascii subset
lemmatize rule
tokenize rule
vectorizative rule
etc

Experiment CNN and LSTM

For the CNN model which is short name for Convolutional Neural Network, it normally is used for image part, but we just use it for our baseline here to see what is the worst case. We can find the model definition below.

nlp cnn model

For LSTM model which is Long Short-Term Memory, it has everything from Recurrent Neural Network (RNN) plus feedback connections. For NLP, it makes more sense when using LSTM or RNN, because some later words will impact the earlier words in one sentence. We can find the LSTM model definition below.

nlp lstm model

Experiment using BERT

Transfer learning is a very powerful part in the machine learning area.

It focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. (Reference 2)

BERT is for Pre-training of Deep Bidirectional Transformers for Language Understanding. We can just add a couple of layers based on the pre-train BERT model. For example, in our case, based on sequence_output layer, we add 4 more dense layers with dropout and regularizer. In the last layers, because we do the binary classification, the num_classes is 2. We can fill the whole BERT model below.

nlp bert mode

CNN vs LSTM vs BERT

Based on all three models, we calculate some performance metrics such as Precision, Recall, AUC and Accuracy. Also we trained our models using 15 epochs.

We can find that BERT has more than 167 times params than others, it takes more time to train and has more good performance result. The BERT we are using is bert_en_uncased_L-24_H-1024_A-16.

performance metric

Data Preprocess

Experiment CNN and LSTM

Experiment using BERT

CNN vs LSTM vs BERT

Footer