Natural Language Generation Part 2: GPT-2 and Huggingface

So its been a while since my last article, apologies for that. Work and then the pandemic threw a wrench in a lot of things so I thought I would come back with a little tutorial on text generation with GPT-2 using the Huggingface framework. This will be a Tensorflow focused tutorial since most I have found on google tend to be Pytorch focused, or light on details around using it with Tensorflow. If you don’t want to read my whole post and just see how it works, I have the following Colab notebook as an outline for people to reference here. This post will be basically going over whats in the notebook so should be easy to reference back and forth.

In my last tutorial I used Markov chains to learn n-gram probabilities from presidential speeches and used those probabilities to generate similar text output given new starting input. Now we will go a step further and utilize a more state of the art architecture to create text output that should be more accurate and realistic. If you haven’t already heard about GPT-2, its a language model from OpenAI trained on a mass amount of data from the web using an architecture called the Transformer. Here is a good visual overview of the transformer architecture used by GPT-2 that should help give you intuition on how it works. GPT-2 is not the most advanced version of the language model from Open AI, but its one that has many reference implementations and frameworks to use compared to the newer GPT-3 model. As well its a version of the model that can run on Colab and is fairly straight forward to setup and hopefully even easier after this tutorial 🙂

Lets talk about the data

For our task we will create a model to generate financial article titles. If we started the task of training the language model from scratch we would need lots and lots of examples (GPT-2 was trained on 8 million web pages). Fine tuning from the pre-trained model means we don’t need to use nearly the same amount to get decent results on our specific task.

The plan is to get a decent amount of examples, couple hundred thousand, and then split them into train and eval sets. I decided to grab data from reddit titles in the /r/investing subreddit and titles extracted from US Financial News Articles dataset from Kaggle. Some of the examples from the joined dataset are not just finance related, since many financial news sites also report on non-financial events and the subreddit data has a mix of investing advice and questions.

The titles pulled from reddit submissions are about 100k and the titles extracted from the Kaggle dataset are about another 179k. That should be enough examples so as to not over fit on our task and give us a rich set of possible text to generate from within the “financial” domain.

Data format

The format of the data seems to make or break the training and output of these models I have found. For GPT-2 if you want to just generate a whole bunch of text, say a book or articles, you can throw all the examples into a single document with no special tokens between examples. However if you want to generate output that follows a certain pattern or prompt, you should add special tokens into the dataset to make it more clear what pattern GPT-2 should attempt to learn to output. Below is the basic format for an example in the dataset for our title generation task.

<|title|>Some title about finances or other things<|endoftext|>

Each example is then concatenated together as one long string. We don’t have to add a start token for training since GPT-2 only needs the ‘<|endoftext|>’ token to split examples, but with this leading token we can then have the model generate new random output on each run when we prompt it with “<|title|>” first. You can set the start token to be whatever you want really, or have none at all, but I have found that setting these tokens to something that wont be likely to show up in the vocab of the data makes it easier to generate coherent text and you won’t be as likely to fall into a repetitive cycle.

The gist above shows the cell step that is used to create our train and eval sets. As you can see when we read in the dataset line by line, then append the <|title|> token to the input then rejoin with <|endoftext|> and write back out to their respective file. Now that we have these two files written back out to the Colab environment, we can use the Huggingface training script to fine tune the model for our task.

How to fine tune GPT-2

For fine tuning GPT-2 we will be using Huggingface and will use the provided script run_clm.py found here. I tried to find a way to fine tune the model via TF model calls directly, but had trouble getting it to work easily so defaulted to using the scripts provided. Some things like classifiers can be trained directly via standard TF api calls, but the language models seem to not be fully supported when I started this work. Its possible newer versions of Huggingface will support this.

python run_clm.py 
--model_type gpt2-medium 
--model_name_or_path gpt2-medium 
--train_file "train_tmp.txt" 
--do_train 
--validation_file "eval_tmp.txt" 
--do_eval 
--per_gpu_train_batch_size 1 
--save_steps -1 
--num_train_epochs 5 
--fp16 
--output_dir=<directory of saved model>

The script above will run the fine tuning process using the medium sized GPT-2 model, though if you are using standard Colab you might only be able to run the small GPT-2 model due to resource limits on the vm. For myself I am using Colab Pro which gives me access to more powerful base machines and GPU’s. Depending on your use case regular Colab may be sufficient or you can use GCP if you really need access to more powerful GPU instances for longer times. Transformer models are very computationally expensive due to their architecture, so when training on a GPU it can easily take hours or days with a large enough dataset.

For the investing title dataset, 5 epochs on a p100 took over 3–4 hours while on a v100 it only took 1.5 to 2 hours depending on the settings I used. Its up to some luck it seems on which GPU you get when starting up your Colab instance. I found I was usually able to get a v100 every other day after a multi hour training session. One thing to call out in the above script call is that I am using mixed precision in the model training with the — fp16 argument. Using mixed precision shaved off about 30 mins of training time with no noticeable drop in model performance when compared to a single precision trained model on our data.

At the end of the model training there is an eval step that happens which tells us our models perplexity. As you can see our title generation GPT-2 model gets us a perplexity score of around 10.6 which isn’t bad considering it only ran for 5 epochs.

So now that we have trained our new language model to generate financial news titles, lets give it a try! We will want to use the path to the directory that the script outputs the model file to, and load it up to see if it will output some great new finance article / reddit titles for us!

To load into TF we will want to import the TFGPT2LMHeadModel and then call from_pretrained, making sure to set the from_pt flag to True. This way it will load the Pytorch model into TF compatible tensors. We will also use the pre-trained GPT-2 tokenizer for creating our input sequence to the model.

The pre-trained tokenizer will take the input string and encode it for our model. When using the tokenizer also be sure to set return_tensors=”tf”. If we were using the default Pytorch we would not need to set this. With these two things loaded up we can set up our input to the model and start getting text output.

After creating the input we call the models generate function. Huggingface has a great blog that goes over the different parameters for generating text and how they work together here. I suggest reading through that for a more in depth understanding. The below parameters are ones that I found to work well given the dataset, and from trial and error on many rounds of generating output. The one thing with language models is that you have to try a number of different parameter options to start to see some good output, and even then sometimes it takes many runs to get output that fits your task so do not be surprised if initial results are less than stellar.

Below is some of the output that was generated by our investing title model given the “<|title|>” token as the prompt.

0: <|title|>Tesla's stock jumps 9% after Musk tweets it will hit $1,000 per share1: <|title|>Avis Budget Group to Announce Fourth Quarter and Full Year 2017 Financial Results on February 27, 2018
2: <|title|>BRIEF-India's Bajaj Finance Dec Qtr Profit Falls
3: <|title|>BRIEF-Dunkin' Brands Reports Q4 Adjusted Earnings Per Share $0.06
4: <|title|>BRIEF-‍UAE's National Investment Posts FY Profit Before Tax Of RMB8.2 Mln
5: <|title|>BRIEF-Cogint Announces $8 Mln Bought Deal Financing
6: <|title|>Question about stock splits.

From the generated examples above they look like believable article and reddit titles. Still sometimes when running you can get some funny output like the one below.

<|title|>Noob

Well that was maybe a bit long of a post, but hopefully you found it useful for learning how to use Huggingface to fine tune a language model and generate some text using a Tensorflow back end. Now with these techniques you can start to come up with different tasks and models for your own work / interests. For instance after building this title model I decided to see if I can generate a title and use that title to generate some sort of article, to varying degrees of success. Try experimenting for your self and see what you can come up with!

Thanks for reading!

Link to colab gist: https://gist.github.com/GeorgeDittmar/5c57a35332b2b5818e51618af7953351

Footer