Creating a perceptron network: from zero to hero

Preface note

This story is part of a series I am creating about neural networks and it is the second part of a chapter dedicated to the perceptron network (you can read part 1 here). In this article I am going to apply all the concepts explained in my last article and implement a perceptron network in Python.

Perceptron in a nutshell

We have seen in my previous articles that the perceptron is a type of neural network that uses the threshold function as an activation function. The way to obtain the output y is by multiplying a set of weights W by a set of inputs X, adding the bias nodes (b) and then apply the threshold function f(z) over this result.

Revisiting my last article, we have also seen how the W matrix would look like with a more complex architecture, with m input nodes and n neurons. However, we have not seen what is the impact of more complex architectures in the notation used for the feedforward equation to obtain the network’s predictions.

(Note: If what comes next does not make much sense to you, I suggest you take a look at basic linear algebra concepts, namely the transpose and multiplication of matrices.)

Let’s assume m=3 and n=5 like in the last article. That means our W matrix will have 3 rows and 5 columns. If we transform the scalar operation to matricial notation, our feedforward equation becomes something like this:

Matricial notation for the feedforward equation

From this “new” feedforward equation, some changes can be highlighted:

The input X is now a matrix with 3 rows and a single column. It has 3 rows because we assumed m=3 (m is the number of input neurons). A column represents one example in the dataset, but it is possible to inject all examples into X and get all the predictions in one shot by using each column to place each sample from the dataset;
The bias nodes b are now a matrix with 5 rows and a single column. This is because we need a bias node per neuron and we defined the number of neurons n as 5;
y will have to be the same size as b because if there are 5 neurons, there will be 5 outputs as well.

The whole equation described above is normally represented in a simpler way by saying: y = f(Wᵀ ∙ X + b). However these two operations can be simplified into a single one by placing b as an extra row of W and adding a special input node to X with a constant value of 1, as this is a mathematically equivalent expression:

Feedforward mechanism

In order to implement the feedforward mechanism we need three components:

A bias node adder. It should add an extra row to X (with the value of 1) to accommodate the additional row for the biases in the weight matrix;
An activation function. It should receive an input z and apply the threshold function in order to provide y;
A feedforward function. It should receive one or more examples (our X), each one to fill the m entries of the network’s input layer, and a weight matrix that describes the connections weights’. Its output should be y.

def activation(z):
return np.where(z>=0, 1, 0)def add_bias_node(X):
return np.vstack([X, np.ones(X.shape[1])])def feed_forward(W, X):
return activation(np.dot(np.transpose(W), add_bias_node(X)))

Training a perceptron

The perceptron training algorithm is an iterative process that aims to find the set of network connections’ weights that minimise the overall error of the network. A single weight of the network can be updated by using the following equation:

At each iteration (also called an epoch), we evaluate the errors (e) by calculating the difference between the predicted output and actual output, and use them together with the input and a learning rate (µ) coefficient to compute how much we are going to change each weight (also known as a delta, or Δ)

The pseudocode of the training process is described in the example below:

W = initialize weight matrix with dimensions (M+1,N)
for each epoch (e):
for each example (x):
y = calculate network prediction using W and x
error = calculate deviation between y and target (t)
delta = calculate weight variations based on the error and learning rate
update W with deltas
return W

After materialising this into Python code, we get something like this:

def train(W, inputs, targets, epochs, lr):
for epoch in range(epochs):
for x, t in zip(inputs.transpose(), targets):
y = feed_forward(W, x.reshape(1, -1).transpose())
errors = np.subtract(y, t)
x_with_bias = add_bias_node(x.reshape(-1,1))
deltas = lr * np.dot(x_with_bias, errors.transpose())
W = np.subtract(W, deltas)
return W

Applying the perceptron to a ML problem

Now that we have built the foundations for the perceptron neuron, we can apply it to the Titanic dataset, a well known dataset used mostly to get introduced with machine learning basics. The goal is to predict who survived to the Titanic disaster based on a set of features described in the dataset.

In the context of this article, I used only two features to train the model: the fare paid by a passenger and its age, as this makes it easier to visualize the dataset and the model. Using two features means working in a 2D space, and it also means that our neural network architecture is composed by a single neuron (n=1) with two entries (m=2) (three if we include the bias node).

import seaborn as snstitanic = sns.load_dataset('titanic')M = 2
N = 1X = np.asarray(titanic[features]).transpose()
W = np.random.rand(M + 1, N) # + 1 because of the bias node
t = titanic['survived'].to_numpy()

In the piece of code above we gathered X, W and t in order to compute y and compare the error rate before and after the training process. The initialisation of W was done in a random manner, following a uniform distribution between 0 and 1 as this is a possible way to initialise the weights of network. Bear in mind that this is a very important topic in the context of training as the initialisation process can have a big impact on the training speed. However, this topic will be detailed later in the series and it is out of the scope of this article.
Now that we have all the matrices in place with the right shapes, let’s apply the feed forward mechanism before and after training:

y = feed_forward(W, X)
error_rate(y.flatten().tolist(), t.tolist()) # 0.59383W_trained = train(W, X, t, 20, 0.1)y = feed_forward(W_trained, X)
error_rate(y.flatten().tolist(), t.tolist()) # 0.31512

We managed to decrease the number of examples incorrectly predicted by the neural network from 0.59 to 0.31, this means that our model accuracy is roughly 69%. But how did the neural network’s ability to predict our samples changed throughout the training process? In order to answer this, we can simply get the predictions at each epoch, and draw the classifier’s decision boundary. This boundary, in the 2D context, is the line for which there is equal probability of a given passenger to have survived or not.

In the gif below you can see the network’s predictions together with the classifier’s decision boundary. Orange samples are passengers correctly predicted as survivors, green samples are passengers correctly predicted as non-survivors and gray samples are the ones that were wrongly predicted by the perceptron.

A perceptron getting trained using the Titanic dataset

Final observations

Although we managed to improve the accuracy of the network, there are a few thoughts we can jot by observing the training process in the animation above:

The problem is not linearly separable
We have seen that the perceptron convergence theorem proves that it is possible to achieve an optimal solution if there is linear separability. The XOR problem also illustrates the limitations of a perceptron. In this problem we can see that the perceptron struggles to find a line that completely separates the negative from the positive class…because that line is not possible to be drawn, as there will always be gray dots in the plot a.k.a. wrong predictions.
Adding more features may help improving the performance
We picked two features to be used to solve this problem. However, what will happen if we add more features? Perhaps linear separability will be achieved by having an hyperplane that separates both classes, or at least we will get closer to that. I did that small experience in my perceptron notebook.