For my final project in my Artificial Intelligence class for my Data Science Masters, I chose to compare two models; one using Markov principles and the other a Deep learning model created by OpenAI for Natural Language Generation purposes.
Natural Language Processing (NLP) has had some exciting advancements these last five years, which is credited to technology advancements that have sped up computation which has led to the rise of Deep Learning models in NLP. One of the areas that have advanced in NLP is Natural Language Generation (NLG) which is currently utilized in tasks :
- that involve paraphrasing or summarizations
- question & answering for chatbots
- text suggestions/ completion, etc.
Early methods of NLG were rule-based involving rules that dictated the outline of the messages and statistical methods to improve paraphrasing, syntax, and semantic similarity. One of the first major advances were in the form of neural networks, and some shallow neural networks that created new ways to represent textual information. It transformed large text into a lower-dimensional space without sacrificing information loss in latent representations available Methods like word2vec, Glove, and Seq2Seq that generate word embeddings in the form of vectors which one can compute distance-based calculations such as Euclidean distance or cosine similarity. Lastly, deep learning models that deploy transformer architecture to help facilitate transfer learning which holds attention mechanisms in their layers have been referred to as the state of the art methods in NLP and NLG . The main difference between the RNNs such as LSTM Networks is the data is passed sequentially versus the Transformer Models utilizes parallel computing and transfers the input sequence in parallel and utilizes encoders and decoders that are then transformed into probabilities with a softmax activation.
I was curious to compare one of the latest and greatest NLG models, a GPT-2 model versus a much simpler model using Markov principles in creating the next Tweet of a few politicians. I chose to create the next tweet for Joe Biden, Kamala Harris, and Elizabeth Warren. I chose these Twitter accounts because in the last few months they have been tweeting heavily and their tweets are usually on the longer side and are coherent. Utilizing the package Tweepy, I extracted the last 100 tweets as of 12/7/2020 to be utilized as the corpus to feed into the NLG models and to be used as a reference and comparison for NLG evaluations.
For those who would rather watch a 10-minute video here is the video presentation :
The Markov model is a stochastic model that depending on the current state which in the case of NLG is the current word-based, then based on the window size or state size it will transition by either randomly or using probabilities to choose the next pair(s) of words to change its state. I used the Markovify package  with a window size of 2 meaning it would change states by randomly picking the next word only if it was previously used with the prior word. Since I was dealing with tweets that are under 280 characters, I knew I had to keep the window size smaller because the bigger the number of the state size the harder it would be to find the next common word that is usually used in the sequence of n words. A nice feature I liked in the Markovify package is you could specify the min and max length of the generated text. Seeing how most of their tweets ranged from 100 to 280 characters I decided to input those characteristics as parameters in the Markov model.
The GPT-2 model from OpenAI  was trained on 40-GB worth of text data from 8 million web pages to predict the next word with 1.5 billion parameters. Prior work trained models were trained on a single domain of text for example just news articles but the GPT-2 model was trained on a diverse dataset so it could be a model that could be implemented in a wide range of NLG tasks. The dataset compiled for the GPT-2 was based on outbound links from Reddit after receiving a threshold of a certain engagement. Using GPT-2 release of their smaller model ‘124M’ which according to Radford et al, is comparable to Bert’s larger models due to a large number of parameters. Unlike the Markovify model, one would need a GPU to leverage the GPT-2 model and to finetune to your own corpus, which was my plan to finetune 3 GPT-2 models one for each reference corpus of the last 100 tweets per politician. One of the reasons I chose to use a GPT-2 model and a Markov model was to compare the NLG tweets between the models and see which model performed better. It would be interesting to see if a simpler model could outperform a GPT-2 model to generate shorter text because from what it looks like from the paper Language Models are Unsupervised Multi Task Learners, the GPT-2 does fairly well with longer outputs of text and there is room for improvement for summarization and question and answering .
This brings me to the evaluation of NLG outputs. There are many different metrics that can measure different things such as grammar, factuality, similarity, coherence, etc. Çelikyilmaz et al, group these NLG evaluation metrics in three categories:
- Human-Centric Evaluation Methods
- Untrained Automatic Evaluation Metrics
- Machine Learned Evaluation Metrics
With the time available for this project I had to choose between Untrained Automatic Evaluation Metrics and/or Machine Learned Evaluation Metrics seeing that Human-Centric Evaluation Methods will be too time-consuming or costly; although, many researchers do leverage Amazon Turk for these Human-Centric Evaluation Methods.
I chose a Machine Learned Evaluation Metric because it is a blend of the first two categories according to Çelikyilmaz et al .
For example, models that are optimized for perplexity can generate coherent responses but are not diverse since they are optimized for predicting unseen test data. Its equation can be thought of as exponentiation of entropy which is why you optimize for a lower perplexity score versus a higher one . I decided to use the BERTSCORE because it has shown to correlate well with human judgments on sentence-level evaluations. It matches the words in the candidate and reference sentences by cosine similarity after creating contextual word embeddings. You can pass in lists into the Bert score so I passed it a list of the 5 generated tweets from the different 3 model runs and a list to cross-reference which were the 100 reference tweets from each politician.
The results were a bit surprising to me if I just considered the average of the Bert Score comparing the Markov model to the GPT-2 Runs. However, when I started to look at the standard deviation of them it made more sense after reading Open Ai’s initial release notes  when they stated it would take several tries especially if the trained model is not familiar with the content that it is being fine-tuned. However, I was pleasantly surprised how even with a small reference dataset (100 Tweets) it was able to maintain a similar tone and content from the dataset that it was being fine-tuned with on top of its massive historical training on 40-GB worth of training data.  Open AI recognizes that among the tries there will be some poor execution which consists of text being repeated, false, or incoherent but will also output some great executions which explain the large standard deviation of the Bert Score.
Below is the highest-scoring Bert Scored Tweet for each model respectively, the Markov model, GPT-2 after 30 runs, and the GPT-2 after 90 runs for each politician. As you can see the Markov model does not do too bad but the GPT-2 model after many runs does exceptionally well but there are some repeated words. The GPT-2 model’s all 5 generated tweets per politician were better when it was fine-tuned with 90 runs versus 30 runs. There were smaller improvements from the 80th run compared to the 90th run but it does make a difference to fine-tune it longer.
So in the end which was the better model? It depends on what you are trying to do. A Markovify model is a lot leaner and smaller than a GPT-2 model which requires a GPU to train on. Although I was able to access one for free on Google Colab. It takes a bit more time to generate and train than a Markovify model and it will need some further optimizations since the GPT-2 generated tweets can leave unfinished sentences, repeat words, and have false claims but if you are able to set a threshold based on a natural language generative evaluation metric such as BERT score it can help identify the better-generated text(s) among the others.
I want to conclude that this project was not to improve or support the creation of fake impersonators, it was a project to determine how different generative text models perform with short text like Tweets and what are some weaknesses and strengths of each one. All of this is for research purposes. I was inspired by some of the fun projects on the Markovify  page such as tweet bots such as a Twitter bot that tweeted Haiku poems and another that impersonated Homer Simpson.
You can find the code for my final project here. If you have any questions/feedback please leave it on Medium below!
 Çelikyilmaz, A., Clark, E., & Gao, J. (2020). Evaluation of Text Generation: A Survey. ArXiv, abs/2006.14799.
 Kim, A. (2020, February 10). Perplexity Intuition (and Derivation). Retrieved December 08, 2020, from https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3
 Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, & Yoav Artzi (2020). BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations.
 Jsvine. (n.d.). Jsvine/markovify. Retrieved December 08, 2020, from https://github.com/jsvine/markovify
 Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners
 Better Language Models and Their Implications. (2019, February 14). Openai.Com. https://openai.com/blog/better-language-models/