Activation Functions

Why activation functions are needed ?

The data on left can be modelled by using a linear function.
- The data is linearly separable.
- linear activation / no activation is sufficient.
But the data on right can’t be modelled using linear function
- Linear model will fail here because it can only create a straight-line decision boundary.
- Therefore, a non-linearity is required to model the data.
Without activation functions, deep networks behave like a simple linear model, limiting their capability.

import torch
import torch.nn as nn

\[f(x) = ax + b\]

PyTorch Implementation :

linear_activation = nn.Identity()
x = torch.tensor([1.0, 2.0, 3.0])
output = linear_activation(x)

\[f(x) = \frac{1}{1 + e^{-x}}\]

Output in range (0,1).
Used in binary classification problems.
Can be interpreted as probabilities (used in logistic regression).
Drawback
- Vanishing gradient problem
  - When inputs are large/small, gradients become very small.

PyTorch Implementation :

sigmoid = nn.Sigmoid()
x = torch.tensor([-1.0, 0.0, 1.0])
output = sigmoid(x)

\[f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}\]

output in range in (-1,1)
Centered around 0, helps in faster convergence
Usefull for hidden layers in deep networks.
Drawback
- Vanishing Gradient Problem (better than sigmoid)
- Computationally expensive

PyTorch Implementation :

tanh = nn.Tanh()
output = tanh(x)

\[f(x) = max(0,x)\]

most widely used activation function.
output in range [0,∞).
Solves vanishing gradient problem (because no exponentials).
efficient computation.
Sparse activation (many neuron output 0).
Drawback
- Dying ReLU Problem (Neurons that output zero remain inactive).
- Not centered around zero.

PyTorch Implementation :

relu = nn.ReLU(x)
output = relu(x)

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha x, & \text{if } x < 0 \end{cases}\]

PyTorch Implementation :

leakyrelu = nn.LeakyReLU(negative_slop=0.01)
ouyput = leakyrelu(x)

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha x, & \text{if } x < 0 \end{cases}\]

PyTorch Implementation :

prelu = nn.PReLU()
output = prelu(x)

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha (e^x - 1), & \text{if } x < 0 \end{cases}\]

PyTorch Implementation :

elu = nn.ELU(alpha=1.0)
output = elu(x)

\[\sigma (x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}\]

PyTorch Implementation :

softmax = nn.Softmax(dim=1) # axis 1 across row
output = softmax(torch.tensor([[1.0, 2.0, 3.0]]))

Choosing the right activation function is crucial for model performance and convergence stability.

Plot of some of the activation functions