Kush Blogs

13. NLP: Positional Encodings

What is the need for Positional Encodings?

Positional Encodings are Input Modifications that are added to the word embeddings, the inputs to the transformer. Transformers process all the words in sentence simulateously (in parallel) unlike RNNs or LSTMs. This parallelism makes them fast but also makes them Permutation Invariant.

Example : The dog chased the cat. Without Positional Encoding (PE), the word the at the start and in the middle are mathematically identical.

  • They have the same Word Emebedding.
  • When processing the word dog, it finds 2 identical the while it was only looking for The at start.
  • It will not be able to find The that happened before it, but instead ends up with 2 identical concept the.

Laguages depend heavily on order to maintain syntactical and semantic sense of a sentence.

  • Red House, Blue Car : We know that this describes House with red and Car with blue.
  • But without order the meaning gets completely lost.
    • It may mean red car and blue house or even blue red and house car (which don't make any sense).

Thus, the ordering of words in a sentence is essential for it to be understandable.


How Self-Attention processes the text

Self Attention calculates a relevance score between words using Dot Product, which is commutative.

So if there are 2 sentences : A : Alice hits Bob and B : Bob hits Alice, and no positional encoding is applied.

  • A : Alice (Query) looks at Bob (Key). The match score is calculated based purely on their semantic meaning.
  • B : Alice (Query) looks at Bob (Key).

Without positioal encoding, the self-attention mechanism produces the exact same context vector for Alice in both sentences. The model thinks Alice is in the exact same situation in both cases. It cannot distinguish Alice as attacker from Alice as victim because that distinction is purely positional.

This problem didn't exist in RNN and LSTMs because they read a word, store it in their memory and then read the next word. Whereas, transformers read all the words at the exact same time.

The problem is that the Transformer trades sequential processing (which is slow but naturally understands order) for parallel processing (which is fast but naturally blind to order).

Positional Encoding is the patch applied to fix this trade-off. It artificially forces the concept of "sequence" back into a system that was designed to ignore it.


Properties of Encoding Scheme

Now that it is established that some information about the position of a word in a sentence is necessary, how to encode this position into the word embeddings? There are a few desirable properties expected from the positional encoding scheme :

  1. Each position must have a unique encoding
  • It should remain consistent regardless of the sequence length.
  1. Linear relation between 2 encoded positions
  • The relationship between positions must be mathematically simple. If encoding for position is known, then the encoding for position should be easy to compute.
  • This makes it easier for model to learn positional patterns.
  • They must also be generated by a deterministic process enabling the model to learn the encoding scheme efficiently.
  1. Able to generalize to sequences longer than ones in training data
  • Encoding scheme needs to be adaptable enough to handle unexpected input lengths.
  • Also they must be extensible to multiple dimensions to handle multimodal data.

Reaching Sinusoidal Embeddings

Method-1 : Integer Labelling

Add the index number to the embeddings. But this will fail as the length of input sentence increases. Higher values will end up blowing and distorting the weights and gradients.

Method-2 : Normalized Fractions

Next logical remedy is to divide the position with the total length of the sentence . But this will fail too as the step size or the gap between words will change depending on the size of the sentences. It may be for a short sentence and for a longer one. This prevents the model to learn a consistent relationship for next word.

Method-3 : Binary Counter

The next method is to implement how a binary counter works. Lets encode the positions using 4-bit vectors :

  • Word 0 : 0 0 0 0
  • Word 1 : 0 0 0 1
  • Word 2 : 0 0 1 0
  • Word 3 : 0 0 1 1
  • Word 15 : 1 1 1 1

A pattern can be observed when observed column-wise from right to left.

  • LSB : toggles every step 0,1,0,1,...
  • High frequency
  • Second bit : toggles every 2 steps 0,0,1,1,0,0,1,1,...
  • Medium frequency
  • LSB : toggles every 4 steps 0,0,0,0,1,1,1,1,...
  • Low frequency

Thus, it can be concluded that position isn't just a single number, it's a collection of frequencies. Lower dimensions oscillate rapidly, while higher dimensions oscillate slowly. This unique combination creates a fingerprint for every integer.

But this will fail too as binary are discrete and will not be able to work with neural-nets who expect continuous floating point numbers. Also 3 011 and 4 100 are adjacent positions but far apart in vector space. Adjacent words should have similar encodings.

Method-4 : Continuous Binary-Counter

To make it continuous instead of the bits flipping between 0 and 1, Sine and Cosine waves oscillating between -1 & 1 can be used. The encoding vector would be of dimension , so :

  • Bit 0 : Fastest high frequency sine wave
  • Bit : Slowest low frequency sine wave
This solves all the problems
  • Continuous : The values float smoothly between -1 and 1.
  • Differentiable : The network can backpropagate through it smoothly.
  • Distance : Adjacent positions have similar vectors.

Formulating Sinusoidal Embeddings

In the binary counter encoding scheme, it can be observed that for the bit, the wavelength grows exponentially. Here the wavelength means how many step it takes for the bit pattern to repeat.

This is a geometric progeression with base 2.

This geometric structure allows exponential growth of wavelengths which covers a massive range of values efficiently. is the dimension which directly corresponds to the wavelength. For . These massive wavelengths of high dimension means the the slope is essentially just a straight line. Thus there is a serious need to compress this range and this why sinusoidal encodings are used which preserve the geometric progression.

By now 2 facts are fixed :

  • We want the first dimension to wiggle fast. Let's stick with a base wavelength of roughly .
    • . This means the value goes through a complete cycle in a span of 6 positions.
    • This is the smallest smooth wave that clearly differentiates integer steps without fluctuating wildly.
  • We want the last dimension to wiggle slow enough that it doesn't repeat within a sentence, so that each position have a unique encoding. For this a large constant 10,000 is picked.
Aim

To give each position a unique encoding

Pair of sine and cosine waves

To identify a point/positon uniquely, both sine and cosine waves are used. If a point moving around a circle is being tracked, and only sine value is known (eg. ), then the point could be going up or going down . This introduces ambiguity regarding direction of movement.

Thus, the pair of sine and cosine helps to know exactly where the point is. This gives the model a complete description of the current state of that frequency.

The model should be easily able to learn relative positions. This relationship should be a linear transformation. If positional encoding at position are defined as a vector , we want there to exist a matrix that depends only on jump distance such that :

Any wave can be written as : , i.e., waves over positions. So let,

Now the model must be able to calculate using a linear transformation (a matrix multiplication).

Since,

The matrix can be recreated perfectly as :

So it is established that for every frequency, a pair of cosine and sine is needed.

To pack these pairs into 512 dimensional vector, the Attention is All You Need paper used the formula :

is the index of the frequency, going from $0$ to $255$ (for a 512-dimensional vector).

Every frequency cab be thought of as a separate clock hand spinning at a different speed. The embedding vector is essentially a collection of clocks for which both the sine and cosine values are needed.


So now lets finalize the formulation of the positional encoding formula. The problem is to figure out the wavelengths of all the 254 slots between start point and end point . And like discussed above, geometric growth/progression is used.

The standard formula for geometric growth to find the value at step out of total steps is:

Here, Total Steps is which is half the size of embedding size. Plugging the values into the formula gives :