For anybody who is just knocking on the door of Deep Learning or is a seasoned practitioner of it, ReLU is as commonplace as air. Air is exceptionally necessary for our survival, but are ReLUs that necessary for deep learning networks?
If yes, then ‘Why?’ is the first question that pops up in the mind because there is a plethora of activations out there for us to select from. Since its inception in 2010 by G. Hinton, it has topped the usage charts in deep learning networks consistently.
Fasten your seatbelts as we explore the reasons for ReLU’s effectiveness as we explore ideas from derivatives to topology. But before we start, a brief introduction of ReLU.
ReLU has the simplest looking equation you would find in Deep Learning.
And if you plot it, the graph itself is too simple,
The next time somebody asks you about ReLU in an interview or in discussion, keep any of these answers handy and you will glide through!
The Easy Answer
It is computationally simple making training faster.
You have seen the equation above, it just needs a single comparison to work. Its calculation is trivial. On the other hand, activations like the tanh or sigmoid have the computation of an exponential accompanying them making it computationally much more expensive.
The other consideration for deep learning is the time it takes to train the networks. The lesser number of computations actually alleviate the anxiety and the anticipation one feels while training them, phew! On a serious note, it helps fasten the process of searching for the best model or the set of hyperparameters for your network.
One can also observe that ReLU follows Occam’s Razor by being so computationally simple to compute in comparison to the other activations such as sigmoid or tanh.
The Better Answer
Non-saturating gradients of ReLU solve the vanishing-gradient problem.
The first question that comes to mind is what are non-saturating gradients?
Above is the graph for a sigmoid activation. After a few epochs, the values of the sigmoid will be touching its periphery (closer to -1 and 1). When the values reach the periphery, the change in the value of sigmoid is very low leading to smaller gradients.
Due to these smaller gradients, the problem of vanishing gradients occurs. The gradients become too small and start to approach 0. These gradients when used in backpropagation start diminishing the error being propagated backward to a point it vanishes.
ReLU doesn’t have this problem at all because the gradient for ReLU is 1. The error is propagated as it is backward hence eliminating this problem.
Bonus Point: ReLU leads to sparse representations of data
Due to the hard threshold of 0 in ReLU’s equation, most of the neurons end up being dead resulting in sparse representation.
A sparse representation is more advantageous over a dense representation.
In a dense representation, a change in input changes almost the whole representation. On the other hand, a sparse representation is more robust to changes in the input. This also implies that a sparse representation has more degrees of freedom thus any change in input only affects a part of the representation and not the whole.
A sparse representation represents the most important correlations of data while a dense representation can take up insignificant correlations of data inside it i.e. due to noise. It can be also be interpreted as obtaining a low signal-to-noise ratio in the representation obtained.
A sparse representation is computationally efficient also since there are a lot of multiplications with 0.
Finally, a sparse representation advantage can be also justified by Occam’s Razor.
The Best Answer
ReLU is easily able to make topological changes in data
The best answer involves topology, the mystical existence in mathematics.
Simply, topology can be explained as data has shape.
There is a stark difference between geometry and topology. Geometry never changes the shape of data (think about reflection, rotation, translation, and dilation). On the other hand, topology does.
This great paper reveals the inner-workings of ReLU.
Above is an image depicting a dataset with two classes (red and green) and how its shape changes as ReLU NN acts upon it. As you can see, the red class is inside the green class. No amount of geometrical changes can bring out the red from the green because any geometrical change will act globally.
What we want is that the different classes get mapped to different locations while being distinctly separate. That is exactly what ReLU doing above.
The paper also provides a convincing reason for the unreasonable effectiveness of ReLU over smooth sigmoidal activations. ReLU is able to change the topology of data much better than smooth sigmoidal activations.
ReLU remains the most popular choice out of all the activation functions. We cover the computationally inexpensive nature of ReLU, the non-saturation of gradients it causes, how it solves the vanishing gradient problem, how it leads to sparse representations, and finally, how its actual power lies in its ability to change the topology of the data.