
Gated Recurrent Units (GRUs) — Gated Recurrent Units are introduced in 2014 by Kyunghyun Cho. Just as LSTMs, GRUs are also gating mechanism in RNNs to deal with sequence data. However, to simplify the calculation process, GRUs use two gates: (i) Reset Gate and (ii) Update Gate. GRUs also use the same values for hidden state and cell state. Figure 5 shows the inner structure of a Gated Recurrent Unit:
In this case study, our goal is to train an RNN capable of generating meaningful text from characters. An RNN can generate text from words as well as from characters, and we select to use characters to generate text for this case study.
When we build a new RNN with no training, it combines a bunch of meaningless characters, which does not mean anything. However, if we feed our RNN with a lot of text data, it starts to imitate the style of these texts and generate meaningful text using characters.
So, if we feed the model a lot of didactic text, our model would generate educational materials. If we feed our model with lots of poems, our model will start generating poems, so we would end up having an artificial poet. These are all viable options, but we will feed our model with something else: A long text dataset containing Shakespeare’s writings. Therefore, we will create a Shakespeare bot.
Shakespeare Corpus
Shakespeare Corpus is a text file containing 40,000 lines of Shakespeare’s writing, which is cleaned and prepared by Karpathy and hosted by the TensorFlow team. I strongly recommend you take a look at the .txt
file to understand the text we are dealing with. The file contains the conversational content where each character’s name is placed before the corresponding part, as shown in Figure 7.
Initial Imports
In this case study, the required libraries are TensorFlow, NumPy, and os, which we can import with the following code:
To be able to load a dataset from an online directory, we can use the util module of the Keras API in TensorFlow. For this task, we will use the get_file()
function, which downloads a file from a URL if it not already in the cache, with the following code:
After downloading our file, we can read the file from the cache with the following Python code. Now, we successfully saved the entire corpus in the Colab notebook’s memory as a variable. Let’s see how many characters there in the corpus are and what’s the first 100 characters, with the code below:
Our entire corpus is accessible via a Python variable, named text
, and now we can start vectorizing it.
Text Vectorization is a fundamental NLP method to transform text data into a meaningful vector of numbers so that a machine can understand. There are various approaches to text vectorization. In this case study, step by step, this is how we go about this:
- Give an index number to each unique character;
- Run a for loop in the corpus and index every character in the whole text.
To give an index number to each unique character, we first have to find all the unique characters in the text file. This is very easy with the built-in set()
function, which converts a list object to a set object.
T
he difference between set and list data structures is that lists are ordered and allow duplicates while sets are unordered and don’t allow duplicate elements. So, when we run the set() function -as shown in the below code-, it returns a set of unique characters in the text file.
Output:
We also need to give each character an index number. The following code assigns a number to each set item, then creates a dictionary of the set items with their given numbers. We also make a copy of the unique set elements in the NumPy array format for later use in decoding the predictions. Then, we can vectorize our text with a simple for loop where we go through each character in the text and assign their corresponding index value and save all the index values as a new list, with the following code:
At this point, we have our char2idx
dictionary to vectorize the text and idx2char
to de-vectorize (i.e., decode) the vectorized text. Finally, we have our text_as_int
as our vectorized NumPy array. We can now create our dataset.
Firstly, we will use the from_tensor_slices
method from the Dataset module to create a TensorFlow Dataset object from our text_as_int
object and we will split them into batches. The length of each input of the dataset is limited to 100 characters. We can achieve all of them with the following code:
Our sequence object contains sequences of characters, but we have to create a tuple of these sequences simply to feed into the RNN model. We can achieve this with the custom mapping function below:
The reason that we generated these tuples is that for RNN to work, we need to create a pipeline, as shown in Figure 10:
Finally, we shuffle our dataset and split it into 64 sentence batches with the following lines:
Our data is ready to be fed into our model pipeline. Let’s create our model. We would like to train our model and then make new predictions. Firstly, let’s set some parameters with the following code:
Now, what is important about this is that our training pipeline will feed 64 sentences at each batch. Therefore, we need to build our model in a way to accept 64 input sentences at a time. However, after we trained our model, we would like to input single sentences to generate new tasks. So, we need different batch sizes for pre-training and post-training models. To achieve this, we need to create a function, which allows us to reproduce models for different batch sizes. The following code does this:
There are three layers in our model:
- An Embedding Layer: This layer serves as the input layer, accepting input values (in number format) and convert them into vectors.
- A GRU layer: An RNN layer filled with 1024 Gradient Descent Units
- A Dense layer: To output the result, with
vocab_size
outputs.
Now we can create our model for training, with the following code:
Here is the summary of our model in Figure 11:
To compile our model, we need to configure our optimizer and loss function. For this task, we select Adam
as our optimizer and sparse categorical crossentropy function as our loss function.
Since our output is always one of the 65 characters, this is a multiclass categorization problem. Therefore, we have to choose a categorical crossentropy function. However, in this example, we select a variant of categorical crossentropy: Sparse Categorical Crossentropy. The reason that we are using sparse categorical crossentropy is that even though they use the same loss function, their output formats are different. Remember we vectorized our text as integers (e.g., [0], [2], [1]), not in one-hot encoded format (e.g., [0,0,0], [0,1,], [1,0,0]). To be able to output integers, we must use a sparse categorical crossentropy function.
To be able to set the customize our loss function, we are creating a basic function containing sparse categorical crossentropy loss:
Now we can set our loss function and optimizer with the following code:
To able to load our weights and save our training performance, we need to set and configure a checkpoint directory with the following code:
Our model and checkpoint directory are configured. We will train our model for 30 epochs and save the training history to a variable called history, with the following code:
Thanks to the simplicity of the model and the way we encode our model, our training does not take too long (around 3–4 minutes). Now we can use the saved weights and build a custom model that accepts single input to generate text.
To be able to view the location of our latest checkpoint, we need to run the following code:
Now we can use the custom function we created earlier to build a new model with batch_size=1
, load_weights
using the weights saved in the latest_checkpoint, use the build function to build the model based on input shapes received (i.e., [1, None]). We can achieve all of these and summarize()
the new model with the following code below:
Our model is ready to make predictions, and all we need is a custom function to prepare our input for the model. We have to set the following:
- The number of characters to generate,
- Vectorizing the input (from string to numbers),
- An empty variable to store the result,
- A temperature value to manually adjust variability of the predictions,
- Devectorizing the output and also feeding the output to the model again for the next prediction,
- Joining all the generated characters to have a final string.
This custom function below does all of these:
It returns our final prediction value and we can easily generate a text and we can print it out with the built-in print function using the following code:
Using Gated Recurrent Unit and Shakespeare Corpus, you built yourself a Shakespearean bot capable of generating text in any length.
Note that our model uses characters, so the miracle of the model is that it learned to create meaningful words from characters. So, do not think that it adds a bunch of unrelated words together. It goes over thousands of words and learns the relationship between different characters and how they are used to create meaningful words. Then it replicates this and returns us sentences with meaningful words.
Feel free to play around with temperature to see how you can change the output from more proper words to more distorted words. A higher temperature value would increase the chances of our function to choose less likely characters. When we add them all up, we would have less meaningful results. A low temperature, on the other hand, would cause the function to generate text that is simpler and more of a copy of the original corpus.
Besides my latest content, I also share my Google Colab notebooks with my subscribers, containing full codes for every post I published. If you liked this post, consider subscribing to the newsletter:
Since you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin
If you like this article, check out my other NLP articles: