Keras has 3 built-in RNN layers: SimpleRNN, LSTM ad GRU.
Starting with a vocabulary size of 1000, a word can be represented by a word index between 0 and 999. (for example, the word “fantastic” can be encoded as integer 361.) In the code example below, the embedding layer takes a sequence of word indexes, representing a text, and transforms it into a sequence of 64-D vectors.
Next, an LSTM layer converts this sequence of vectors into a 128-D vector. At last, it is converted to a 10-D vector using a dense layer in making classification predictions, one prediction for each class.
Here is a summary of the model.
As a reference, the diagram below is an LSTM cell. In the code example above, the LSTM returns the hidden state (a 128-D vector) in the last timestep as the output.
In the example below, we replace the LSTM module with a GRU. If we set return_sequences to True, GRU returns all hidden states from every timestep, instead of the last one. In the diagram below, each hidden state of the GRU is then fed into the corresponding input of the SimpleRNN layer. We take the last hidden start of SimpleRNN and then feed it into a dense layer for classification.
Here is the corresponding code
and the model summary.
return_sequences
As shown before, if we want to output all the hidden states of an LSTM or a GRU layer, we set return_sequences to True.
By setting the return_state to True below, the LSTM returns the output, and the hidden state, and the cell state of the last timestep. “output” is the same as the last hidden state state_h in this example. It is redundant. But if return_sequences equals True, “output” contains all hidden states, not just state_h in the last timestep.
The initial_state tensor is the input hidden state and the cell state for the first timestep. By default, the initial state tensors in LSTM and GRU are zero-filled. But in an encoder-decoder architecture, we can use the last hidden state and the cell state of the encoder (state_h and state_c) to initialize the decoder.
By default, the initial states of the RNN cells are reset for every batch of samples. However, there are situations where we want to keep the states between batches. For example, in meta-learning, we keep learning from previous experience and we don’t want to reset the experience. In other cases, the input sequence may be too long and therefore, we may break it up into sub-sequences during training. In this situation, we do not reset the state between sub-sequences. To keep the state of a cell between samples, we set stateful=True. To reset, we call lstm_layer.reset_states.
Here is an example in which we treat 3 paragraphs to be a single sample. We keep the cell states in the process and reset it only when it is done.
The following diagram shows a bidirectional RNN which contains a forward LSTM and a backward LSTM. For each timestep, we merge the result from the forward pass and the backward pass together to generate an output. There are different options on how the merge is done, for example, concatenation, adding, multiplication, etc …
Here is the code for constructing a classifier using bidirectional layers.
The first bidirectional LSTM has an input shape of (None, 5, 10). With return_sequences=True, it output 5 hidden states, one for each timestep. By default, bidirectional LSTM concatenates the forward and backward pass result together (merge_mode=’concat’). Hence, the output of the first layer is (None, 5, 128) which double the output dimension of a forward LSTM layer.
For the second bidirectional layer, we take the output only (by default, return_sequences=False ). The output of the bidirectional layer is the merging result of the last outputs from the forward pass and the backward pass. Again, by default, it is concatenation. So the output shape is (None, 64) since both the forward and backward LSTM output a 32-D vector.
Here is a model example of using RNN to classify the sentiment of a movie review (positive or negative).
In TensorFlow 2, the built-in LSTM and GRU layers will leverage CuDNN kernels by default when an Nvidia GPU is available. Nevertheless, if any of the default configurations below are changed, CuDNN will not be used. So be aware of the performance impact of choosing a non-standard configuration.
- Change from the tanh activation function.
- Change the recurrent_activation function from sigmoid.
- Change recurrent_dropout from 0.
- Change unroll form False.
- Change use_bias from True.
- If masking is used, change from right padding (discussed later).
RNN, LSTM, or GRU in TF can handle variable size time sequence nicely without extra coding. You can feed data into model(input) with “input” having a different number of timesteps (sequence length). The real issue is in training. Training takes an input Tensor (None, None, embedding_dim) which the first dimension is the batch size and the second dimension is the sequence length.
Padding
Unless you have a batch size of one in training, you need to pad the input to have a fixed length, like the code below.
The mask_zero flag in Embedding instructs the layer to treat zero value as padding and ignore the corresponding input.
If mask_zero is true, the Embedding layer also generates a separate mask tensor masked_output._keras_mask
for the corresponding input.
And this masked tensor will be propagated to the next layer.
Here is the code for setting up an embedding layer with masking in a Sequential Model.
Custom Layer
The mask information will be passed to a layer as “mask” in “call”. In the code below, before computing the softmax value, it masks all the scores corresponding to the padded input to zero.
However, by default, the mask will be passed only once and it will be destroyed after this layer. To passing a mask to the next layer, set supports_masking to True.
Generate mask in a custom layer
A layer can consume a mask but also create a mask. This is done by implementing “compute_mask”.
TensorFlow tutorial
TensorFlow guide