2. NLP: LSTM & GRU
LSTM
LSTMs were developed to handle the drawbacks of RNNs, i.e., their inability to handle long term dependencies.
LSTM are just a special type of RNN which can handle long term dependencies.
The repeating module has a different structure as compared to a RNN (as shown in the below figure).

- red circle : pointwise operation (eg. vector addition)
yellow rectangle : neural network layer
A simple RNN cell has one main neural layer :
LSTM cell consists of 3 gates.
Cell State :
- It is a vector that acts as the long-term memory of the network.
- It carries important information across many time steps in a sequence.
- It is updated / modified only through gates.
Gates of LSTM
1. Forget Gate [ ]
- Helps to decide what information is to be thrown away from the cell state.
2. Input Gate [ ]
- Decides what new information to be stored in the cell state.
- There are 2 parts in it :
Input gate layer :
- A sigmoid / input gate layer decides which value will be added.
New Memory / Candidate Layer :
- Proposes new candidate values vector to be added to the cell state using a
tanhlayer.
- Proposes new candidate values vector to be added to the cell state using a
- These 2 will be combined to get the new cell state update.
3. Update the cell state [ ]
- Updating the cell state from old to new .
- Multiplying old state by denotes the forgetting the info which was decided to be forgotten earlier.
4. Ouput Gate [ ]
- Decide which part of the cell state is going to be the output.
- Done in 2 parts :
Deciding the parts going to the output
- Using a sigmoid layer
Get the output
- Use a tanh layer for values to be in range [-1, 1].
How LSTM solve the Vanishing Grdient Problem
The Cell State is the key of the LSTM.
Additive updates instead of multiplicative
- avoids the repeated multiplicative squashing that causes vanishing gradients.
Control Flow of gradients using Forget Gate
- If = 1 and = 0 then : $C_t = C_{t-1}
- Thus, the gradient can flow inchanged across many time steps.
- It provides a highway for gradients to flow.
Variations of LSTM (using peepholes)

- All the gates get a direct connection to the previous cell state.
- Since the cell state contains richer memory than the hidden state , allows for more informed decision making.
Updated gate eqautions

GRU
- It is a simplified variation of LSTM.
- It has fewer parameters and simpler architecture.

GRU removes the cell state and merges everything into a single hidden state.
Combines Input gate and Forget gate into a single Update Gate
- It decides how much of the past to keep vs. how much new info to add.
Removed the output gate, thus directly outputs the hidden state.
Added a new Reset Gate to control how much past info to forget when computing the new candidate
- GRUs are faster to compute and has fewer parameters.
Implementation of LSTM in a sentiment analysis task can be found here