Splitting the Data
Now that we’ve turned the data from machine-gibberish into machine-readable, we have to split it up along two axes: input/output and training/testing. The first step here is to split it into input and output data, because once we’ve done that Python has a nice module that will do the train-test split for us. Separating input and output data just means putting our predictors we’re using into one group and the thing we’re trying to predict in another group. In the case of mushrooms, it means we need to create one list that has the lists of their characteristics (recall that these are color, shape, etc.) and another that says whether or not they are poisonous. Here’s what that looks like in Python:
#Split data into inputs and outputs
input_data =  #a list for inputs
output_data = #a list for outputs
for i in mushrooms: #loop through each individual mushroom in mushrooms#add the last item to outputs as a one or a zero
input_data.append(i[:-1]) #add everything except the last item to inputs
if i[-1:] == 16:#edible mushrooms have the last item as 16, we turn this into a one
else: #otherwise the mushroom is not safe, and we represent this with a zero
This code creates empty lists for the characteristics of mushrooms and their edibleness (also a word). Since everything in our data file except for the last column is supposed to move us towards the goal of figuring out if the last column says ‘edible’ or ‘poisonous’ (or for our network 44 or 16), all the data for each mushroom except the last row goes into the input list, and the last number goes into the output list. The code accomplishes this by looping through mushrooms and doling out the data into input_data and ouptut_data according to that condition. Crucially, the first row of input_data matches up with the first row of ouptut_data and so on, thus our network is able to use the two lists to see how well it’s doing at predicting the safety of a mushroom and gradually adjust itself.
You’ll notice that for the list of outputs, instead of 16 and 44 I use 1 and 0, respectively; this is to take advantage of Tensorflow’s ability to model binary functions whose output is specifically either one or zero.
The last piece of cake cutting we have to do before we can build our network is a train-test split. This simply means that we set aside a certain portion of our data to test our network on after we’ve used the other portion to train it. This split helps catch and avoid overfitting, a pitfall where our network gets so good at evaluating on one set of data that it no longer predicts the general case very well. The good news: this split isn’t just easy, it’s two lines of code easy with sklearn’s train_test_split function. It works like this:
from sklearn.model_selection import train_test_split#split the data
input_training, input_testing, output_training, output_testing = train_test_split(input_data, output_data, test_size = 0.2)
The function returns four lists, which I’ve named input_training, input_testing, output_training, and output_testing. I’ve chosen to set the parameter test_size to 0.2, which means that the parts returned that concern training contain 80% of the data and the parts concerning testing contain 20%. Now that we’re finally done with all the boring data manipulation, we can get to creating a network.
Building the Network
I’ve decided to use a very simple model for our network, called the sequential model. It can hold as many layers as we want, and in this model each layer gets its data from the last one and feeds into the next one. Also, I’ve used dense layers, which means that every node is connected (via a weight) to every node in the layer before and after it if a layer like that exists (the first and last layers don’t have nodes before and after them, respectively).
With that said, we’ve got a couple decisions that the data has already made for us with our layers and some that we need to make for ourselves. We know that our input layer has to have 22 nodes in it as we have 22 different potential predictors of edibleness. We also know that our output layer only has to have one node (which I will refer to as the output node) because our only possible outputs are one, meaning edible, and zero, meaning poisonous. Of course, the output node won’t always have the value of one or zero; rather, it will be some value that can be interpreted as one or zero. For example, our network may figure out that if the value of the output node is above 0.5 the mushroom is edible and otherwise it is poisonous. We will help our network do this a little bit later with the sigmoid activation function for the output node.
What we can decide on is the number of hidden layers we have and how many nodes are in those layers. There’s an intrinsic tradeoff here: the more hidden layers and nodes we have, the more complex a function our network can model, but it will also be slower as it has to figure out more weights and consequently do more burdensome math. The correct balance will vary for each data set. For our mushrooms example, I found that two hidden layers, each with four nodes, was sufficiently efficient and accurate.
So, to sum up: we need a network with four layers. The first one, the input layer, has 22 nodes, the second and third, both hidden layers, have four nodes each, and the final, output layer has one node. The final decision point here is what activation function to use for each layer. Activation functions just tell our network how to evaluate each node beyond the input layer (nodes in the input layer are just the values we give our network so we can tell it how to read those without an activation function).
For this network, I used ReLu, or rectified linear unit, to activate the hidden layers. ReLu simply takes the value of each node in the hidden layer and sets it to zero if it’s negative. Otherwise, it returns the value itself. ReLu is useful because it allows us to process less data, increasing efficiency, and create a less linear model for our neural network to better model curves. Moreover, the function that models for ReLu, f(x) = max (0, x) is very easy for Python to compute and takes little processing power, making it very efficient.
I used the Sigmoid activation function for the output layer, which takes its value and normalizes it down to somewhere between zero and one. This allows our network to set a threshold for whether or not a mushroom is safe that it can be reasonably sure that all the outputs will conform to. If that’s a bit abstract, think about it this way: without Sigmoid, we could have outputs all over the place and it would be impossible to make a statement like “all outputs below 0.5 (or any other number) constitute poisonous mushrooms” because 0.5 would be such an arbitrary threshold when outputs could range from -789 to 1234. However, when outputs can only be between zero and one, to number will take our network by surprise so a threshold is easy to create. Note that Sigmoid uses a fractional exponential function to do it’s modeling, so it is rather computationally taxing. Fortunately, we only have to use it for one node in our network.
Let’s take a look at how all of this materializes as code with Tensorflow and Keras. Here I’ve imported them and renamed them tf_k for simplicity.
#Import keras, an API for tensorflow
from tensorflow import keras as tf_k#Create a network with a sequential model
network = tf_k.Sequential()
#Add 2 hidden, dense layers to the network, each with four nodes and relu activationnetwork.add(tf_k.layers.Dense(4, activation = "relu", name = "hidden2"))
network.add(tf_k.layers.Dense(4, activation = "relu", name = "hidden1"))
#Add an output node with sigmoid activation
network.add(tf_k.layers.Dense(1, activation = "sigmoid", name = "output"))
With that, we’re able to construct the network described above. Using network.add we add all the required dense layers to our sequential model, and we’re able to set the number of nodes with the first parameter (or hyper-parameter because that’s a value that the coder sets rather than the network learning it). I’ve also described the activation function as a parameter of each layer, and named each layer for future reference, perhaps to visualize or debug
You may have noticed that we did not add an input layer. We don’t need one, as we simply give the list input_training to tensorflow and it feeds that directly into our first hidden layer.
Training the network
The last step before our network becomes operational is training it. To do that, we have to put it together and fit it to our data with the following two lines:
#put the network together
network.compile(optimizer = "adam", metrics = ["accuracy"], loss = tf_k.losses.binary_crossentropy)
#Train the network with out training data and 25 epochs through the data
network.fit(input_training, output_training, epochs = 20, verbose = 2)
The first of these lines brings all of our layers together into a unified model that will use Tensorflow’s built in optimizer adam, which uses gradient descent, a function to update our network’s weights. It also tells our network that we’re shooting for high accuracy by making that the metric we evaluate it under. Lastly, the network learns that it should use the binary_crossentropy loss function, which simply figures out how well our network is doing by telling us how close it’s getting to outputting the wrong binary variable. As indicated by its name, binary_crossentropy is used specifically for binary predictions like the difference between poisonous and edible mushrooms, perfectly suiting our purposes. Like I explained previously, the network ‘learns’ by attempting to minimize this loss function.
The second line actually trains the network by fitting its weights to minimize the loss function when looking at the data sets input_training and output_training. The parameter epochs allows us to dictate how many times the network goes through the data and refines it’s weights. Again here, we have to choose between efficiency and accuracy. I found that 20 times over created a good balance for this particular network. Verbose simply gives a number of different options as to how the network shows you its learning. I like verbose = 2’s method of visualization, though something else could work for a different individual.
Adding a Testing Step
The final line of code I needed was one that used the testing data to make sure the network could work for the general case.
#Test the network with the testing data
network.evaluate(input_testing, output_testing, verbose = 2)
This is a cakewalk compared to everything else. All we have to do is tell the network which data to use to evaluate itself, hence the parameters input_testing and output_testing, and how we want to see the results. Here again, I use verbose = 2 as my preferred method of visualization.
All right. We’ve put it the hard work and built our neural network. Time to take it out for a spin. After 20 epochs of training and a look at the testing data, we get the following results from our program. I’ve displayed results from just epochs 1, 10, and 20 and the testing phase to avoid a tedious amount of programming output in this article.
loss: 2.6525 - accuracy: 0.6885
loss: 0.0562 - accuracy: 0.9813
loss: 0.0188 - accuracy: 0.9949
loss: 0.0152 - accuracy: 0.9923
Cleary, for each epoch that goes by the network gets better and better. At first, the network does only a little better than a random guess and certainly worse than a skilled human, reaching the correct conclusion about the mushroom only 69% of the time. However, we must trust the process, for as it wears on, the loss function returns smaller and smaller values and the network’s accuracy approaches 100%.
Now, we can’t get too excited yet, as such high accuracy may be indicative of overfitting. If the network never gets anything wrong on the training data it may have simply mastered that to the point where it only works on that data. We have no idea if that mastery will translate to any other mushrooms. But fortunately, the network seems to be really good at the general case as well, because it also got over 99% accuracy on the testing data. Interestingly, it seems that there was a slight bit of overfitting as the network tested a little bit worse than it did on the final training epoch. Though, it actually had less loss on the testing set so perhaps this was merely a coincidence and not a result of the common pitfall.
Now, if you think machine learning is really cool but think you’ll never encounter wild mushrooms, fear not! The technology outlined in this article can be applied to almost any field where there is a cause-effect relationship. For example one could input a movies reviews and how many tickets it sold to try and figure out if it will break even. All that needs to happen is changing the data set that is fed in, and perhaps tweaking the hyper-parameters. More thought provoking prospects include predicting credit card fraud, to medical disease diagnosis, to a new cracking the mystery of protein folding just last month, the cliché that the possibilities are limitless rings true when it comes to neural networks.
- Neural networks use inputs, outputs, and hidden layers to make predictions about the real world.
- They can tackle almost any problem with a cause-effect relationship.
- One such application is predicting whether or not a mushroom is edible based on some of its properties.
- To build a neural network, one must:
- Gather and translate data so the network can understand it.
- Split that data into inputs and outputs and testing and training data.
- Build a network and decide on the number of hidden layers, how many nodes go in those layers, and what activation functions to use.
- Train the network with the training data.
- Test the network with the testing data to make sure it is not overfitting.
The source code for this article is in this Github repo.
If you want to learn more, there are loads of resources online. I would specifically recommend edx’s CS50: Introduction to AI course. If you want to follow up with me, have a conversation, or have further questions, here is my contact info:
And please subscribe to my monthly newsletter!