If you heard the music produced by the Markov Chain, I’m sure you are not very captivated. To move a step closer to the music you have in your playlists, recurrent neural networks are a faster, more versatile, path. Before we talk about RNN models, we must understand how an artificial neural network functions. After provided with a matrix of data, the neural network runs this data through a series of hidden layers with nodes and connections. These hidden layers contain parameters (weights) that control the strength of the connection between two nodes.
The diagram above shows an input given, a weightage being multiplied, and a bias added to get to the final output. Two training parameters, the weight and bias, will develop over time through back-propagation. This is essentially going back through the neural network and changing the training values based on the error given and gradient descent. The next iteration of data will be completely independent. After countless iterations, the model morphs into the perfect depiction of the situation we are trying to teach. While this ANN is useful for several scenarios, when we are working with music, we are discussing sequential values rather than independent ones. To combat this individuality shown by each iteration of the ANN, the deep RNN feeds previous abstractions of data back into the model. This recurrence allows the network to keep track of a sequence of values and makes the new calculations dependant on the previous.
The implementation of RNNs in music composition is traced back to Peter M. Todd who, in 1989, attempted to generate a piece and documented the whole process in The Connectionist Approach to Algorithmic Composition.
Like we discussed earlier, we see through the feedback loop that Peter M. Todd is using a recurrent network to produce a sequential chain of notes (Given by Note N, Note N + 1, etc). Taking influence from his work, Michael C. Mozer finalized a piece, “After Bach”, which can be seen below.
This piece was a major step up from the music created by the Markov Chains. It almost sounds like it could be part of a Mario game. Although it may sound better, RNNs are limited by their short term memory. When faced with large pieces of music, the RNN will not be able to depict the beginning of the song efficiently. For example, if there is a Motzart piece with a mellow beginning, the RNN may forget that and only focus on the characteristics of the last segment. This happens due to the vanishing gradient problem. This problem occurs in a neural network when the gradient, derivative of the sigmoid function, is too small.
Although we do not have the time to go too far into this problem, a sigmoid function is an activation function used in machine learning for its ability to handle probabilities from 0 to 1. During back-propagation of a neural network, we are using the derivative of the sigmoid function to make edits to the weights. If you at the domain outside of [-4,4], you can see that these gradient values are approaching 0. This is causing the effect of the gradient to slowly “vanish” and disables RNNs from actually learning and adjusting nodes. This will cause the model to only be able to learn a certain amount of music before the problem starts to arise.
Long Short Term Memory
The limitations of a basic RNN model can be seen through the monotonous tone of “After Bach”, the piece produced by Michael C. Mozer. To combat this problem, Douglas Eck altered the nodes in the RNN to “Long Short Term Memory” cells (LSTM). You can view his documentation of the project here. Rather than just accepting that this method works, let’s talk in-depth about what an LSTM actually is. When looking at the RNN, we saw that the vanishing gradient problem arose because the multiplication of so many small gradients leads to a gradual decrease in learning. LSTM models are engaging the neural network in a way that this gradient is trapped in each of the synapses while also being able to backpropagate the network efficiently.
Above, is a very abridged version of an LSTM. When contrasting this with a standard recurrent neural network, we see that the difference primarily comes in during the “state” portion of the cell. An LSTM, in essence, uses the state along with the previous data and the new data to teach the network. The outputs that we receive from this network are then used as future inputs while also modifying the state during every iteration.
Another variation that appears in the LSTM is the input and output gates. These gates are put in place to manage the behaviors of the network using a combination of weights and biases. For example, the orange gate to the left is managing how much the state value will actually affect the output. The gates on the right are managing how much the output and state will be varied in the current iteration. Although talking conceptually about LSTM cells will provide you with a good idea of how they work, the mathematics behind this system is enough for a whole new article. The implementation of LSTM systems by Douglas Eck can be seen below through his “Blues Improvisation” piece.
Although it isn’t necessarily something you would want to ever listen to in your free time, when we put this piece in contrast to “After Bach” and “Pithoprakta”, we see ourselves growing towards the more idealistic compositions of music and away from a random assortment of notes.