Kush Blogs

2. NLP: LSTM & GRU

LSTM

  • LSTMs were developed to handle the drawbacks of RNNs, i.e., their inability to handle long term dependencies.

  • LSTM are just a special type of RNN which can handle long term dependencies.

  • The repeating module has a different structure as compared to a RNN (as shown in the below figure).

Image
  • red circle : pointwise operation (eg. vector addition)
  • yellow rectangle : neural network layer

  • A simple RNN cell has one main neural layer :

  • LSTM cell consists of 3 gates.

Cell State :

  • It is a vector that acts as the long-term memory of the network.
  • It carries important information across many time steps in a sequence.
  • It is updated / modified only through gates.

Gates of LSTM

1. Forget Gate [ ]

  • Helps to decide what information is to be thrown away from the cell state.

2. Input Gate [ ]

  • Decides what new information to be stored in the cell state.
  • There are 2 parts in it :
    • Input gate layer :

      • A sigmoid / input gate layer decides which value will be added.
    • New Memory / Candidate Layer :

      • Proposes new candidate values vector to be added to the cell state using a tanh layer.

    • These 2 will be combined to get the new cell state update.

3. Update the cell state [ ]

  • Updating the cell state from old to new .
  • Multiplying old state by denotes the forgetting the info which was decided to be forgotten earlier.

4. Ouput Gate [ ]

  • Decide which part of the cell state is going to be the output.
  • Done in 2 parts :
    • Deciding the parts going to the output

      • Using a sigmoid layer
    • Get the output

      • Use a tanh layer for values to be in range [-1, 1].

How LSTM solve the Vanishing Grdient Problem

  • The Cell State is the key of the LSTM.

  • Additive updates instead of multiplicative

    • avoids the repeated multiplicative squashing that causes vanishing gradients.
  • Control Flow of gradients using Forget Gate

    • If = 1 and = 0 then : $C_t = C_{t-1}
    • Thus, the gradient can flow inchanged across many time steps.
    • It provides a highway for gradients to flow.

Variations of LSTM (using peepholes)

Image
  • All the gates get a direct connection to the previous cell state.
  • Since the cell state contains richer memory than the hidden state , allows for more informed decision making.

Updated gate eqautions

Image

GRU

  • It is a simplified variation of LSTM.
  • It has fewer parameters and simpler architecture.
Image
  • GRU removes the cell state and merges everything into a single hidden state.

  • Combines Input gate and Forget gate into a single Update Gate

    • It decides how much of the past to keep vs. how much new info to add.
  • Removed the output gate, thus directly outputs the hidden state.

  • Added a new Reset Gate to control how much past info to forget when computing the new candidate

  • GRUs are faster to compute and has fewer parameters.

Implementation of LSTM in a sentiment analysis task can be found here