2. NLP: LSTM & GRU

📅 2025-05-08 | #NLP

LSTM

LSTMs were developed to handle the drawbacks of RNNs, i.e., their inability to handle long term dependencies.
LSTM are just a special type of RNN which can handle long term dependencies.
The repeating module has a different structure as compared to a RNN (as shown in the below figure).

red circle : pointwise operation (eg. vector addition)

yellow rectangle : neural network layer
A simple RNN cell has one main neural layer :
$h_{t} = t anh (W_{x h} x_{t} + W_{hh} h_{t - 1} + b)$
LSTM cell consists of 3 gates.

Cell State :

It is a vector that acts as the long-term memory of the network.
It carries important information across many time steps in a sequence.
It is updated / modified only through gates.

Gates of LSTM

1. Forget Gate [ $f_{t}$ ]

Helps to decide what information is to be thrown away from the cell state.
$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$

2. Input Gate [ $i_{t}$ ]

Decides what new information to be stored in the cell state.
There are 2 parts in it :
- Input gate layer :
  - A sigmoid / input gate layer decides which value will be added.
  - $i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$
- New Memory / Candidate Layer :
  - Proposes new candidate values vector to be added to the cell state using a tanh layer.
  - $\tilde{C}_{t} = t anh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})$
- These 2 will be combined to get the new cell state update.

3. Update the cell state [ $C_{t}$ ]

Updating the cell state from old $C_{t - 1}$ to new $C_{t}$ .
$C_{t} = f_{t} \circ C_{t - 1} + i_{j} \circ \tilde{C}_{t}$
Multiplying old state by $f_{t}$ denotes the forgetting the info which was decided to be forgotten earlier.

4. Ouput Gate [ $o_{t}$ ]

Decide which part of the cell state is going to be the output.
Done in 2 parts :
- Deciding the parts going to the output
  - Using a sigmoid layer
  - $o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})$
- Get the output
  - Use a tanh layer for values to be in range [-1, 1].
  - $h_{t} = o_{t} \circ t anh (C_{t})$

How LSTM solve the Vanishing Grdient Problem

The Cell State is the key of the LSTM.
Additive updates instead of multiplicative
- avoids the repeated multiplicative squashing that causes vanishing gradients.
- $C_{t} = f_{t} \circ C_{t - 1} + i_{j} \circ \tilde{C}_{t}$
Control Flow of gradients using Forget Gate
- If $f_{t}$ = 1 and $i_{t}$ = 0 then : $C_t = C_{t-1}
- Thus, the gradient can flow inchanged across many time steps.
- It provides a highway for gradients to flow.

Variations of LSTM (using peepholes)

All the gates get a direct connection to the previous cell state.
Since the cell state $C_{t - 1}$ contains richer memory than the hidden state $h_{t - 1}$ , allows for more informed decision making.

Updated gate eqautions

$f_{t} = σ (W_{f} \cdot [C_{t - 1}, h_{t - 1}, x_{t}] + b_{f})$

$i_{t} = σ (W_{i} \cdot [C_{t - 1}, h_{t - 1}, x_{t}] + b_{i})$
$o_{t} = σ (W_{o} [C_{t - 1}, h_{t - 1}, x_{t}] + b_{o})$

GRU

It is a simplified variation of LSTM.
It has fewer parameters and simpler architecture.

GRU removes the cell state and merges everything into a single hidden state.
Combines Input gate and Forget gate into a single Update Gate $z_{t}$
- It decides how much of the past to keep vs. how much new info to add.
Removed the output gate, thus directly outputs the hidden state.
Added a new Reset Gate $r_{t}$ to control how much past info to forget when computing the new candidate $\tilde{h}_{t}$

$z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}])$
$r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}])$
$\tilde{h}_{t} = t anh (W \circ [r_{t} * h_{t - 1}, x_{t}])$
$h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * \tilde{h}_{t}$

GRUs are faster to compute and has fewer parameters.

Implementation of LSTM in a sentiment analysis task can be found here

2. NLP: LSTM & GRU

LSTM

Cell State :

Gates of LSTM

1. Forget Gate [ ft​ ]

2. Input Gate [ it​ ]

Input gate layer :

New Memory / Candidate Layer :

3. Update the cell state [ Ct​]

4. Ouput Gate [ ot​]

Deciding the parts going to the output

Get the output

How LSTM solve the Vanishing Grdient Problem

Additive updates instead of multiplicative

Control Flow of gradients using Forget Gate

Variations of LSTM (using peepholes)

Updated gate eqautions

GRU

1. Forget Gate [ $f_{t}$ ]

2. Input Gate [ $i_{t}$ ]

3. Update the cell state [ $C_{t}$ ]

4. Ouput Gate [ $o_{t}$ ]