While linear function approximate the right hand dataset quite well, the perform miserably for the one at the left. Since non linearly distributed datasets dominate the realm of machine learning, and activation functions are the only suitable spot to inject nonlinearity in to the network, there is no scope for the the function to be linear. Some of the renowned functions to address this problem are,
Sigmoid function: This is a function that takes in a number and outputs a number in the range of (0, 1). The smaller the input, closer is the output to zero, greater it is, closer it approaches to 1, without ever touching either of the extremeties.
tanh function: Also known as the hyperbolic tangent function is pretty similar to the sigmoid except the output ranges from -1 to 1, without ever touching the either.
The one at right is the plot for sigmoid and the other for tanh.
Problem 3: Vanishig Gradient
Notice the graphs once again, for the sigmoid, the curve almost flattens while approching either extremities. That is for inputs of very large absolute values, the output doesn’t reflect a significant change. This drastically retards the learning rate. This is known as the vanishing gradient problem, where the gradient (ie. learning) diminishes as the process progresses.
This limitation is addressed by the ReLU activation function, the de facto default activation for moder neural networks.
ReLU function: Standing short for Rectified Linear Unit, this is probably the most interesting and counter-intuitive function of all. It is an almost linear function with a single knee at the origin. It is defined as a piece-wise function,
The question often asked is how can a near-linear function save the day. From the graph its obvious that it can almost completely avoid the gradient decent problem. Single ReLUs are flat and boring, but an army of them can be unbeateable. Have a look how it approaches the Annulus Ring problem,
Linear utterly fails to approximate, tanh attempts and draws a smooth circle-like curve and fianlly, ReLU — comes in with a hexagon. You might think of this like this —having two pieces of stright line joined at one end, like the hands of clock. Now imagine having a million of such lines. You can replicate almost any nonlinear shape with that arrangement. The beauty of ReLU is that it is near-linear and a bunch of them can be moulded to almost any shapes by twisting the knee angle to the perfection.
Finally that brings us to the last problem to be addressed.
Problem 4: Probabilistic Sum
Whatever is done within the hidden layers, ultimate goal at the end of the day is to generate some sort of prediction and the style here is to output a likelihood vector containing the list of probability of the input object being one of the labelled ones. For example, the result of the neural network in the first picture may like,
[dog, cow, cat, goat, lamb] = [0. 2, 0.05, 0.7, 0.03, 0.01]
The result will most likely be cat since it’s got the highest score of 0.7.
To convert the final output of a fully connected layer into a noramlized vector of probability, a function is required that takes in a vector and spits out a vector too, only that the sum of each element of the output vector be equal to 1. After all, that’s what makes it a probility vector.
Sigmoid function closely resembles a probability function since its outputs range from 0 to 1, however, fails to satisfy the condition of the sum being equal to one.
The special function that achieves this feature is the Softmax function.
Softmax function: This function takes in a vector and for each element calculates the corresponding entry in the output vector using the follwoing function,
The denominator being the sum of all possible numerators prevents the outputs to exceed the value of 1.