Activation Functions

  • It is used to introduce non-linearity into the neural network.
  • This allows to model complex relatioships in data.

Why activation functions are needed ?

Image source

  • The data on left can be modelled by using a linear function.
    • The data is linearly separable.
    • linear activation / no activation is sufficient.
  • But the data on right can’t be modelled using linear function
    • Linear model will fail here because it can only create a straight-line decision boundary.
    • Therefore, a non-linearity is required to model the data.
  • Without activation functions, deep networks behave like a simple linear model, limiting their capability.

Types of Activation functions

import torch
import torch.nn as nn

1. Linear Activation Funtion

\[f(x) = ax + b\]
  • Output is proportional to input
  • Doesn’t introduce non-linearity.
  • Rarely used.

PyTorch Implementation :

linear_activation = nn.Identity()
x = torch.tensor([1.0, 2.0, 3.0])
output = linear_activation(x)

2. Sigmoid Activation (σ)

\[f(x) = \frac{1}{1 + e^{-x}}\]
  • Output in range (0,1).
  • Used in binary classification problems.
  • Can be interpreted as probabilities (used in logistic regression).

  • Drawback
    • Vanishing gradient problem
      • When inputs are large/small, gradients become very small.

PyTorch Implementation :

sigmoid = nn.Sigmoid()
x = torch.tensor([-1.0, 0.0, 1.0])
output = sigmoid(x)

3. Hyperbolic Tangent (Tanh)

\[f(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}\]
  • output in range in (-1,1)
  • Centered around 0, helps in faster convergence
  • Usefull for hidden layers in deep networks.

  • Drawback
    • Vanishing Gradient Problem (better than sigmoid)
    • Computationally expensive

PyTorch Implementation :

tanh = nn.Tanh()
output = tanh(x)

4. Rectifed Linear Unit (ReLU)

\[f(x) = max(0,x)\]
  • most widely used activation function.
  • output in range [0,∞).
  • Solves vanishing gradient problem (because no exponentials).
  • efficient computation.
  • Sparse activation (many neuron output 0).

  • Drawback
    • Dying ReLU Problem (Neurons that output zero remain inactive).
    • Not centered around zero.

PyTorch Implementation :

relu = nn.ReLU(x)
output = relu(x)

5. Leaky ReLU

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha x, & \text{if } x < 0 \end{cases}\]
  • output in range (-∞,∞).
  • modified ReLU, allows a small gradient for negative inputs
  • default α = 0.01
  • prevents Dying ReLU Problem

  • Drawback
    • Requires tuning of slope

PyTorch Implementation :

leakyrelu = nn.LeakyReLU(negative_slop=0.01)
ouyput = leakyrelu(x)

6. Parametric ReLU (PReLU)

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha x, & \text{if } x < 0 \end{cases}\]
  • output in range (-∞,∞).
  • Unlike Leaky ReLU, α is learned during training.
  • Same equation as Leaky ReLU
  • Adaptive slope improves performance
  • Avoids dying ReLU issue.

  • Drawback
    • Extra parameter α increases computation.

PyTorch Implementation :

prelu = nn.PReLU()
output = prelu(x)

7. Exponential Linear Unit (ELU)

\[f(x) = \begin{cases} x, & \text{if } x \geq 0 \\ \alpha (e^x - 1), & \text{if } x < 0 \end{cases}\]
  • output in range (-∞,∞).
  • smooths out the output for negative values
  • Avoids dying ReLU.
  • Helps with vanishing gradients

  • Drawback
    • More computationally expensive.

PyTorch Implementation :

elu = nn.ELU(alpha=1.0)
output = elu(x)

8. Softmax

\[\sigma (x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}\]
  • output in range (0,1).
  • Used in the final layer of classification networks. (Multi- CLass)
  • Outputs probabilities.
  • Converts logits(unnormalized scores) into probabilities summing to 1.

  • Drawback
    • Can be overconfident in predictions (sensitive to large values)

PyTorch Implementation :

softmax = nn.Softmax(dim=1) # axis 1 across row
output = softmax(torch.tensor([[1.0, 2.0, 3.0]]))

Summary

  • Activation functions introduce non-linearity to neural networks.
  • ReLU is widely used in deep learning due to its efficiency.
  • Tanh is preferred over Sigmoid due to its zero-centered outputs.
  • Softmax is essential for multi-class classification tasks.

Choosing the right activation function is crucial for model performance and convergence stability.

Plot of some of the activation functions Image