Arabic NLP tutorial on creating Arabic Sentence Embeddings with Multi-Task Learning for fast and efficient Semantic Textual Similarity tasks.
In the first article of this Arabic natural language processing (NLP) series, I introduced a transformer language model named AraBERT (Arabic Bidirectional Encoder Representations from Transformers) released by Antoun et al. (2020), which performs exceptionally well on a variety of Arabic NLP benchmarks. As is typical of state-of-the-art language models, AraBERT is quite large, the base model has 110 million parameters, and the large model has 340 million parameters. When one considers the size of these language models, it becomes evident that an accessibility gap exists between the pragmatic researcher and the usage of state-of-the-art NLP tools.
As determined by one’s access to resources, the superior results of cutting-edge language models can be underscored by a host of production considerations. For example, researchers might be limited by the availability of computing resources, an expense which inevitably has to be balanced with tradeoffs between money and time. Prohibitive costs aside, social research greatly benefits from the application of NLP, especially since a data-driven approach provides alternative angles from which to investigate common problems. This is particularly relevant for under-researched regions like the Middle East and North Africa (MENA), where inclusion of Arabic text is not only essential for accountability, but also aids in avoiding Western-bias.
In an article written last year, I discussed my interest in Transformer sentence embeddings, an idea I encountered in a research paper that details the training of efficient sentence embeddings from transformer language models. The paper described how an NLP task, like finding the most similar pair from a set of 10,000 sentences, as determined by assessing semantic textual similarity (STS), would require 50 million inference computations with a transformer language model like AraBERT. This would take roughly 65 hours to complete, making AraBERT ill-suited for semantic similarity search or unsupervised tasks such as clustering. The clever solution devised by researchers at the Ubiquitous Knowledge Processing Lab, was to train transformer sentence embeddings using a siamese network architecture; this allowed for the aforementioned STS task to be completed in about 5 seconds.
Their creation Sentence-BERT (SBERT), is well suited to STS tasks, and this tutorial with code outlines an extension of this NLP tool to Arabic using multi-task learning (MTL). I was inspired by the example script shared in UKP Lab’s sentence-transformers repository (created and maintained by SBERT author, Nils Reimer), which implemented MTL to train an English sentence embedding model. In this tutorial, I firstly provide some background on STS and sentence embeddings, followed by a discussion of MTL. Next, I describe the experimental setup used to train an Arabic sentence embeddings model (which I named SAraBERT), followed by a full code walkthrough of the MTL training with intuitive explanations of the process. Lastly, I evaluate the trained model on an Arabic STS benchmark and offer ideas for utilizing this tool for social research.
STS relates to the similarity of meaning between a pair of sentences, and it can be measured with similarity measurements such as cosine similarity or Manhattan/Euclidean distance. Intuitively, sentence embeddings can be understood as a document processing method of mapping sentences to vectors, as a means of representing text with real numbers suitable for machine learning. Given a collection of sentences, sentence embeddings can be used to transform the text units into fixed-sized output-vectors that represent features, which are then comparable across the sentence collection. A comparison is possible because sentences are mapped to a vector space such that semantically similar sentences are closer together. Evaluating semantic similarity this way is useful for a variety of NLP tasks, including information retrieval, paraphrase identification, duplicate question detection and extractive text summarization.
Prior to the release of SBERT, other less effective methods were used to create fixed-sized sentence embeddings from a Transformer model like BERT. Most commonly, mean pooling was used to average BERT’s output layer, or alternatively, by using the output of the [CLS] token (the first token in BERT embeddings, the classification token). The SBERT authors show that both types of sentence embeddings perform poorly on STS tasks, often worse than averaged Glove word embeddings (unsupervised algorithm that uses global statistics to learn word vector representations). This is noteworthy since word embeddings from BERT-type models, compared to Glove word embeddings, have significantly higher scores on most NLP benchmarks. The poor performance of averaged word embeddings, however, suggests that pooling alone does not produce sentence embeddings that are well-suited to sentence-level similarity tasks.
This necessitates sentence embeddings like SBERT, which uses a siamese architecture where two networks have tied weights. For the best performance on STS tasks, SBERT starts with pre-trained BERT models, and a siamese set-up is used to fine-tune the models; first on a natural language inference (NLI) dataset and then fine-tuned further on an STS dataset.
From the diagram above, when fine-tuning an NLI dataset, two sentence embeddings u and v are produced from pooled BERT word embeddings. They are concatenated with the element-wise difference ⏐u — v⏐, and then multiplied by the trainable weight (Wt ); where Wt ∈ ℝ³ⁿ*ᵏ, and n is the dimension of the sentence embedding and k is the number of labels. This can be represented by the following classification objective function which is optimized by cross-entropy loss.
When fine-tuning on an STS dataset, a regression objective function is used, where by the cosine similarity between two sentence embeddings u and v is calculated, and mean-squared error loss is used as the loss function. The regression objective function is represented as cosine_sim(u,v) and the loss function is: ||input_label — cos_score_transformation(cosine_sim(u,v))||², where the default cos_score_transformation is a simple identity function that causes no change (Source: sentence-transformers).
MTL relies on inductive transfer of knowledge; rather than training a model for a single task in isolation, a model can be trained for several related tasks in parallel so that learned information is shared between the tasks. To quote Caruana (1998) via Sebastian Ruder, “MTL improves generalization by leveraging domain-specific information contained in the training signals of related tasks.”. For an accessible overview of multi-task learning for deep neural networks, I suggest Ruder’s blog post on the topic, for brevity I limit details in this article and focus on the high-level concepts.
Essentially, this learning approach takes advantage of the commonalities and differences between tasks to train models that generalize better. Typically regularization prevents overfitting by penalizing complexity uniformly; however, in contrast, regularization with MTL is induced by requiring good performance on a related task. In other words, there is an inductive bias where a model favours hypotheses that explain more than one task, a preference that improves generalization. This inductive bias means that MTL is particularly effective on small datasets or when class labels are undersampled. Along with regularization, MTL also introduces implicit data augmentation and attention focusing. In a sense, MTL implicitly increases the sample size, allowing the model to learn a more general representation; this is because, jointly learning two tasks averages the data-dependent noise patterns from the different tasks. Attention-focusing in MTL is enabled by the highlighting of important features, this is possible since the model has additional evidence for the relevance or irrelevance of features, from other tasks.
Since 2018, from a survey of the available literature, MTL has been utilized several times for Arabic NLP research, notably for translation models concerned with Arabic dialects, and offensive speech detection on social media. In 2019, Abdul-Mageed et al. used a sentence-level BERT model and sentence-level MTL models to classify age and gender from a dataset of annotated Arabic tweets; a study in which they found that MTL two-task models were inferior to single-task BERT models. The researchers describe their models as language-agnostic, since their models were built on a sentence-level multilingual BERT (MBERT) model and fine-tuned for classification tasks on Arabic text samples. I have chosen a different approach, I create a sentence-level Arabic BERT model (SAraBERT) that has been fine-tuned with Arabic NLI and STS data. This is because I am interested in Arabic-specific sentence embeddings that can be used for STS tasks such as text summarization. Additionally, the current state-of-the-art approach for Arabic text classification tasks is to use AraBERT word embeddings (released 2020).
Experimental Setup
In this tutorial, MTL is used to fine-tune a sentence-level AraBERT (SAraBERT) model on two datasets (NLI and STS) in a shared task, rather than sequentially fine-tuning on a single task at a time for each dataset. A classification objective function with cross-entropy loss is used for the NLI dataset, and a regression objective function with mean-squared error loss is used for the STS dataset. The joint learning involves a batch from each task being matched with all other batches for the other task; this round-robin matching happens in an iterative fashion.The intuition is that shared information between tasks will improve generalization. I use the previously mentioned, sentence-transformers Python library from the UKP Lab, and for this tutorial, I use version two of a non-segmented AraBERT model “bert-base-arabertv02” that is available through Huggingface models.
Since there are no open-source Arabic-specific NLI datasets available, for an NLI dataset, I partitioned out the 2,490 Arabic sentence pairs from Facebook’s Cross-Lingual NLI Corpus (XNLI). These Arabic sentence pairs are labeled for textual entailment, each premise/hypothesis pair is labeled as “entailment”, “contradiction” or “neutral”, based on the relationship between the hypothesis and the premise. A NLI dataset is used specifically, because logical entailment is different from simple equivalence and provides more signal for learning complex semantic representations.
The most popular Arabic STS benchmark, SemEval-2017 STS, has only 1,081 sentence pairs that are the product of translation from English. Due to concerns over translation quality and dataset size, I chose instead to use a more recently released Arabic semantic similarity dataset consisting of question pairs. The Arabic Semantic Question Similarity (SQS) dataset from the Workshop on NLP Solutions for Under Resourced Languages 2019 contains, 12,000 question pairs that are labeled as “yes” or “no” for semantic similarity between the two questions. The SQS dataset is much larger than SemEval-2017 STS, and it has the advantage of being in original Arabic.
All code for this tutorial is in Python using the Pytorch framework. I suggest using a Google Colab notebook to take advantage of the free-tier GPU instance to speed up the training. The first step is to import the required packages, which are listed in the code snippet below.
The next step is to load the XNLI (nli_data) and the Arabic SQS (sts_data) training datasets. From XNLI we separate out the Arabic sentence pairs (arabic_nli_data), then we split the Arabic SQS train set for validation data during training (sts_data_train, sts_data_test), while having conserved the actual test set as holdout data to evaluate the final model.
Next we set “bert-base-arabertv02” as the model_name, and set the output path for saving the model during training. Here the batch size is set to 16, which will easily run on a 16GB GPU, the default size for Colab’s free-tier cloud GPU. A model is built using three separate modules: a word embedding layer, a mean-pooling layer and a dense layer, stacked such that each consecutive module starts with the output of the previous module. The max sequence length is set to 256, a size which is matched by the output features of the dense layer to guarantee that the resultant sentence embeddings have a max sequence length of 256.
A label2int dictionary maps the string labels of “contradiction”, “entailment” and “neutral” to 0, 1 and 2, respectively. An empty array is created to hold the samples, which are iteratively appended from the arabic_nli_data dataframe. Each sample is a list of [‘sentence1’, ‘sentence2’, ‘label_id’], and these samples have to be loaded in a Pytorch DataLoader. Lastly, we set a loss function from the classification objective function previously described, seen here as SoftmaxLoss.
For the STS data, two empty arrays are needed, one for the train samples and the other for the development samples. The samples are iteratively added to their respective arrays, in the form of a list: [‘question1’, ‘question2’, ‘label_id’]. Similarly to the NLI data, the samples are loaded to a Pytorch DataLoader. CosineSimilarityLoss is used for a loss function, where the regression objective function previously described is optimized by the mean-squared error loss of the cosine similarity scores. Additionally, we create an evaluator to test the similarity of the embeddings using the development samples, a process which also allows us to select for the best model during training.
Lastly we set a few parameters for the training, the number of epochs is set to 4, 10% of the training data is used for warm-up, and evaluation is set to happen every 1000 steps. The train objectives is a list of tuples, each a pair has a DataLoader and loss function, with one tuple for each task. When fitting the model, evaluation happens once every 1000 steps throughout the 4 epochs, and the best model is saved from each evaluation stage.
Post training, the last step is to evaluate the model on the holdout data in the STS test set. An empty array is created for the test samples, and loaded similarly to the STS training data as a list of [‘question1’, ‘question2’, ‘label_id’]. The model is then loaded and a test evaluator is created to test the similarity of embeddings from the test samples.
The results shown below are for cosine similarity, Manhattan distance, Euclidean distance and dot-product similarity, as measured by the Pearson correlation and Spearman correlation metrics. In a 2016 paper titled, Task-oriented intrinsic evaluation of semantic textual similarity, Reimer et al. concluded that intrinsic evaluations with Pearson correlation are misleading, and that the Spearman correlation metric is better suited for evaluating STS tasks. By this metric, SAraBERT achieves a score of 83.94%, the Spearman correlation being representative of the ability to accurately determine whether two questions are similar.
Next steps include experimenting with other methods for training Arabic sentence embeddings, and assessing how well SAraBERT is able to summarize Arabic text or perform a semantic search. Training SAraBERT in Colab took less than five minutes, and it is possible with a personal PC to produce Arabic sentence embeddings for a large corpus fairly quickly.
Final thoughts
If the past few years are any indication, the trend of increasingly large language models will continue to dominate the NLP field, making it necessary for a researcher to focus on accessibility. I view resource limitations as a creativity challenge, and luckily there is a large open-source community that shares both code and pre-trained models. Despite the availability of these models, however, the practicality of implementation hinges on correctly aligning tools with tasks. Fast and efficient Arabic sentence embeddings with SAraBERT make it possible to quickly utilize a variety of NLP techniques that rely on semantic search. For my research, this allows for easy experimenting with text summarization and information retrieval for policy evaluation. Furthermore, it is possible to use SAraBERT for unsupervised tasks, like clustering based on the semantic similarity of sentence embeddings, which provides a deep learning alternative to traditional statistical models such as Latent Dirichlet Allocation, for the task of topic identification.
MTL is an interesting ML paradigm, I suspect that there are better options for training sentence embeddings for STS tasks, an avenue I intend to explore in the future. In my opinion, SAraBERT is a low resource method, built on a low-resource language NLP tool (AraBERT) and trained in a low-resource setting (free GPU + train time < 5 mins). My hope is that this cobbled-together creation born out of necessity, will provide utility for other researchers interested in applying Arabic NLP techniques to social research. I welcome questions and feedback, please feel free to connect with me on Linkedin. Finally, many thanks to the UKP Lab and Nils Reimer for open-sourcing the resources that made this tutorial possible.