Why position even matters
A vanilla self-attention layer is permutation-equivariant: shuffle the input tokens and the output shuffles in exactly the same way. That's great for sets — but terrible for language, where "dog bites man" and "man bites dog" mean very different things.
Vaswani et al. (2017) solved this by adding a positional encoding vector to each token embedding before the first attention layer. Their choice — sinusoids — turns out to have a beautiful algebraic property that we'll derive from scratch.
Setup — the encoding formula
For a model with hidden dimension , frequency index runs from to . Define the angular frequency:
Then the positional encoding at position is:
Each adjacent pair of dimensions forms a unit circle in 2-D, rotating at angular speed as increases. Lower dimensions rotate fast (high frequency); higher dimensions rotate slowly (low frequency).
Computing the dot product
Take two positions and . Their dot product sums over all dimension pairs:
Applying the cosine subtraction identity
Recall from trigonometry:
Set and . Then each summand collapses:
Substituting back into the sum gives the closed form:
Why this proves translation invariance
The right-hand side depends only on the gap , not on or individually. Shift both positions by any constant and the dot product is unchanged:
Every pair at distance 3 gives exactly the same dot product, regardless of where in the sequence the pair sits.
Seeing it — the flat-line plot
The script below computes, for each distance , the dot product of every pair and plots them. Every line should be perfectly flat.
1import numpy as np
2import matplotlib.pyplot as plt
3
4seq_len = 15
5d_model = 8
6
7def get_positional_encoding(seq_len, d_model):
8 PE = np.zeros((seq_len, d_model))
9 for pos in range(seq_len):
10 for i in range(0, d_model, 2):
11 div_term = 10000 ** (i / d_model)
12 PE[pos, i] = np.sin(pos / div_term)
13 PE[pos, i + 1] = np.cos(pos / div_term)
14 return PE
15
16PE = get_positional_encoding(seq_len, d_model)
17
18# For each selected distance d, collect dot products of ALL pairs (i, i+d)
19selected_distances = list(range(1, seq_len, 2))
20
21dot_products_dict = {}
22for d in selected_distances:
23 dot_products_dict[d] = [
24 float(np.dot(PE[i], PE[i + d]))
25 for i in range(seq_len - d)
26 ]
27
28for d, vals in dot_products_dict.items():
29 print(f"\nDistance {d} ({len(vals)} pairs):")
30 for i, v in enumerate(vals):
31 print(f" ({i}, {i + d}) -> {v:.4f}")
32
33# Plot: one line per distance, x = starting index i, y = dot(PE[i], PE[i+d])
34plt.figure(figsize=(9, 5))
35for d, vals in dot_products_dict.items():
36 xs = list(range(len(vals)))
37 plt.plot(xs, vals, marker="o", label=f"distance {d}")
38
39plt.xlabel("Starting index i (pair = (i, i+d))")
40plt.ylabel("Dot product PE[i] · PE[i+d]")
41plt.title(f"Pairwise dot products per distance (d_model={d_model}, seq_len={seq_len})")
42plt.grid(True)
43plt.legend()
44plt.tight_layout()
45plt.show()That flatness is not a numerical accident — it's the algebraic identity made visible. Each horizontal line corresponds to a fixed .
Takeaway
The sinusoidal positional encoding from Attention Is All You Need has an elegant closed-form inner product:
This single equation explains three things at once:
- Translation invariance — relative distance, not absolute position, drives the inner product.
- Smoothness — cosine varies continuously, so nearby positions have similar encodings.
- No learned parameters — the frequencies are fixed by the closed form .
In the next post we'll look at learned positional embeddings (as used in BERT) and see what you gain — and give up — when you replace the fixed sinusoids with a trainable lookup table.