OK, so what is an activation function? If you don’t know, let me give you a bookish definition.

An activation function is a mathematical function used in a neural network that activates the neurons and introduce non-linearity by transformation of the inputs. They are also known as **Transfer Functions.**

Now, we all know such definitions, and apply the activation functions in our daily deep learning problems. But, here in this article I will try to answer most of the doubts we have regarding activation functions with very minimal math and lots of intuition.

## Prerequisite

- You have some understanding on how a deep neural network works, may be a simple feed forward neural network.
- Basic idea of numerical optimizations like, Gradient Descent.
- Basic idea on how back propagation works.

## Contents

- What is the actual purpose of the activation function?
- Different types of activation functions.

## Activate Neuron

Let’s start with the phrase that activation functions do ** activate neurons**. So, what is the meaning of activating a neuron. In simple, an activation function controls that if a neuron will give some output or not. For example we can think a step function as an activation function.

You can see that, if the value of ‘x’ is greater than zero, then the activation function (step function) gives and output as ‘1’ otherwise ‘0’. Some sort of how does a switch works (ON & OFF). Based on the input value (here ‘x’) the activation function (step function) is either ‘firing’ the neuron (output=1) or ‘not firing’ it (output=0).

Imagine you are designing a classification (binary) model by using the ‘step function’ as your activation function. So it’s quite easy, isn’t it? If you get an output=1 so it belongs to a correct class otherwise it’s not. So, basically a step function is nothing but some sort of ‘max()’ function, to be specific ‘hard max’ function. This ‘hard’ is something where the main drawback arises. If you try to do a multi-class classification problem, you might end up using multiple step functions as well. But, you can see that sometimes more than one step functions are giving you output=1, then how would you arrive at the conclusion of the correct class? Don’t you think if you get the output with some weights (probabilities) associated with it, then life would have been easier? But doing this thing with step function (hard max function) is quite difficult, that’s why we have something called “soft-max” (some sort of weights/ probabilistic approach) activation function! Now you understood why we used “softmax” in a classification task at the last layer (mostly) and why the function is called “soft”max. Don’t you?

## Reduction of compute complexity

For simplicity, consider designing a fully connected feed forward neural network. A hidden layer equation can be written as below.