In this experiment, instead of using a premade architecture, I wanted to play around with a neural network architecture that felt more biologically inspired. I hadn’t seen too much modern literature on inhibitory neurons in AI, nor concepts like neuroplasticity, so I went about seeing if I could create an architecture that included some of these core ideas from neuroscience.
After wrestling around with some code, getting it to finally learn, I did some googling and it turns out unsurprisingly that these types of networks have been known for almost four decades now. I found this paper and many more by Erol Gelenbe from the ‘80s on RNN’s (not recurrent neural nets, but random neural nets, although they are recurrent) which is very close to what I’m doing in this medium article, with some minute changes related to reinforcement learning.
From here on, I’ll talk about my implementation of these ideas. You can think of the architecture as a collection of neurons, a certain percentage of which are inhibitory and a certain percentage of which are excitatory. Not every neuron needs to be connected to every other neuron as is the case with fully connected DNN layers. One can define a hyperparameter that biases how many outbound connections a neuron can have. In this implementation, at times, there won’t be any neurons that can meet the threshold of their connections. If the agent gets stuck in this state, the system will call a “neuroplasticity” function after some time. This essentially re-initializes some of the connections that may have been deleted previously during the learning process.
Notice there is no concept of layers in this system. The input and output neurons are excitatory. The blue hidden neurons are also excitatory, however, the red ones are inhibitory. The opacity of the connections represents the weights or perhaps more intuitively, the strength of the connections. The size of the neurons represents the threshold required to fire its outbound connections. The weights and connections are randomly initialized at runtime. My naïve inspiration for this architecture was just observing how our own brain seems to fire.
To make it learn I simply introduced reward and punishment functions that adjust both the weights and thresholds by a learning rate. We achieve a recency bias by pushing the most recent firings in an array for each epoch and only rewarding/punishing the connections and associated neurons in that array. I probably could get a cleaner recency bias by introducing some sort of gradient for past epochs, but I didn’t get around to it.
Let’s see how we do with a basic task. Here I reward a system that outputs numbers(0–9) without 1’s in them and punish the agent otherwise. At first, it just spits out random numbers, but eventually converges to a solution:
I noticed that it will converge at different speeds and in different ways depending on the number of hidden neurons and the ratio of inhibition vs excitation. I didn’t have to adjust the learning rate at all for this toy problem.
Let’s look at a different task: drawing. Here I give the agent a blank 5 x 5 canvas to draw on. I wrote a reward function that rewards the agent when it draws on a cell that has an even number on the x-axis. Notice how it converges onto a solution:
I did have to increase the learning rate upon the neuroplasticity event to get this to converge. The problem is that there’s not much room for generality or exploration here, as the network will only converge to a solution that meets the specification of our reward function. These reward functions merely use exploitation and there’s no real exploration.
What if we learn the reward function itself? Sure, the computation would take more time, but the hope would be that we’d get a more general and creative system, which is the goal of AGI. Consider young children playing. They often choose goals and tasks that have no association with their lower-level reward system, yet they are still innately motivated by these underlying limbic system processes (they get hungry, angry, tired, etc.).
With a system like this, we would be guiding the reward system to learn when to reward or punish the agent, as opposed to directly rewarding the agent ourselves. This system should be able to create goals at its leisure that have nothing to do with its lower-level goals, much like children do when playing.
I accomplished this rather lazily by simply instantiating the same architecture, this time the second object (reward model) would get extra input neurons that fire when the actual agent’s output neurons fire and the only two output neurons of the reward model would either reward or punish the agent itself. The reward function would then be ran against the reward model to guide the system.
This did work and converged without any problems. As expected it took a lot longer to do so, but I didn’t see the creativity that I hoped to see. It’s probably because my reward function stayed the same and wasn’t complex enough to see any real difference in terms of creativity and exploration. I’m going to keep playing with that aspect of the system to see what emerges.
That’s all I have for now. Thanks for reading.