nirmata.dev — Build. Think. Go Deep.

Why position even matters

A vanilla self-attention layer is permutation-equivariant: shuffle the input tokens and the output shuffles in exactly the same way. That's great for sets — but terrible for language, where "dog bites man" and "man bites dog" mean very different things.

Vaswani et al. (2017) solved this by adding a positional encoding vector to each token embedding before the first attention layer. Their choice — sinusoids — turns out to have a beautiful algebraic property that we'll derive from scratch.

Setup — the encoding formula

For a model with hidden dimension $d_{\text{model}}$ , frequency index $k$ runs from $0$ to $\lfloor d_{\text{model}}/2 \rfloor - 1$ . Define the angular frequency:

\omega_k = \frac{1}{10000^{\,2k/d_{\text{model}}}}

Then the positional encoding at position $p$ is:

\begin{aligned} \mathrm{PE}[p,\;2k] &= \sin\!\left(p\,\omega_k\right) \\[4pt] \mathrm{PE}[p,\;2k+1] &= \cos\!\left(p\,\omega_k\right) \end{aligned}

Each adjacent pair of dimensions forms a unit circle in 2-D, rotating at angular speed $\omega_k$ as $p$ increases. Lower dimensions rotate fast (high frequency); higher dimensions rotate slowly (low frequency).

Intuition check: At

k = 0

\omega_0 = 1

, so the encoding completes a full revolution every

2\pi \approx 6.28

positions. At

k = d_{\text{model}}/2 - 1

the period is roughly 10 000 positions — longer than almost any real sequence.

Computing the dot product

Take two positions $i$ and $j$ . Their dot product sums over all dimension pairs:

\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_{k}\Bigl[\sin(i\omega_k)\sin(j\omega_k) + \cos(i\omega_k)\cos(j\omega_k)\Bigr]

Applying the cosine subtraction identity

Recall from trigonometry:

\cos(A - B) = \cos A\cos B + \sin A\sin B

Set $A = j\,\omega_k$ and $B = i\,\omega_k$ . Then each summand collapses:

\sin(i\omega_k)\sin(j\omega_k) + \cos(i\omega_k)\cos(j\omega_k) = \cos\!\bigl((j-i)\,\omega_k\bigr)

Substituting back into the sum gives the closed form:

\boxed{\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_{k}\cos\!\bigl((j-i)\,\omega_k\bigr)}

Why this proves translation invariance

The right-hand side depends only on the gap $d = j - i$ , not on $i$ or $j$ individually. Shift both positions by any constant $\Delta$ and the dot product is unchanged:

\mathrm{PE}[0]\cdot\mathrm{PE}[3] = \mathrm{PE}[1]\cdot\mathrm{PE}[4] = \mathrm{PE}[2]\cdot\mathrm{PE}[5] = \cdots = \sum_{k}\cos(3\,\omega_k)

Every pair at distance 3 gives exactly the same dot product, regardless of where in the sequence the pair sits.

This is why Vaswani et al. chose sinusoids. The inner product between two positional encodings is a function of their relative distance only. That gives the attention mechanism a uniform, translation-invariant signal about "how far apart" two tokens are — without any learned parameters.

Seeing it — the flat-line plot

The script below computes, for each distance $d$ , the dot product of every pair $(i,\, i+d)$ and plots them. Every line should be perfectly flat.

python

1import numpy as np
2import matplotlib.pyplot as plt
3
4seq_len = 15
5d_model = 8
6
7def get_positional_encoding(seq_len, d_model):
8    PE = np.zeros((seq_len, d_model))
9    for pos in range(seq_len):
10        for i in range(0, d_model, 2):
11            div_term = 10000 ** (i / d_model)
12            PE[pos, i]     = np.sin(pos / div_term)
13            PE[pos, i + 1] = np.cos(pos / div_term)
14    return PE
15
16PE = get_positional_encoding(seq_len, d_model)
17
18# For each selected distance d, collect dot products of ALL pairs (i, i+d)
19selected_distances = list(range(1, seq_len, 2))
20
21dot_products_dict = {}
22for d in selected_distances:
23    dot_products_dict[d] = [
24        float(np.dot(PE[i], PE[i + d]))
25        for i in range(seq_len - d)
26    ]
27
28for d, vals in dot_products_dict.items():
29    print(f"\nDistance {d} ({len(vals)} pairs):")
30    for i, v in enumerate(vals):
31        print(f"  ({i}, {i + d}) -> {v:.4f}")
32
33# Plot: one line per distance, x = starting index i, y = dot(PE[i], PE[i+d])
34plt.figure(figsize=(9, 5))
35for d, vals in dot_products_dict.items():
36    xs = list(range(len(vals)))
37    plt.plot(xs, vals, marker="o", label=f"distance {d}")
38
39plt.xlabel("Starting index i  (pair = (i, i+d))")
40plt.ylabel("Dot product  PE[i] · PE[i+d]")
41plt.title(f"Pairwise dot products per distance (d_model={d_model}, seq_len={seq_len})")
42plt.grid(True)
43plt.legend()
44plt.tight_layout()
45plt.show()

That flatness is not a numerical accident — it's the algebraic identity made visible. Each horizontal line corresponds to a fixed $\sum_k \cos(d\,\omega_k)$ .

Takeaway

The sinusoidal positional encoding from Attention Is All You Need has an elegant closed-form inner product:

\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_k \cos\!\bigl((j-i)\,\omega_k\bigr)

This single equation explains three things at once:

Translation invariance — relative distance, not absolute position, drives the inner product.
Smoothness — cosine varies continuously, so nearby positions have similar encodings.
No learned parameters — the frequencies are fixed by the closed form $\omega_k = 1/10000^{2k/d}$ .

In the next post we'll look at learned positional embeddings (as used in BERT) and see what you gain — and give up — when you replace the fixed sinusoids with a trainable lookup table.

← Back to seriesNext post — coming soon