Series · Positional Embeddings · Post 1 of N

Dot Products & Translation Invariance in Sinusoidal Positional Encodings

Why PE[i]PE[j]\mathrm{PE}[i] \cdot \mathrm{PE}[j] depends only on jij - i, not on absolute positions — fully derived, then verified numerically.

Apr 20, 2026

Why position even matters

A vanilla self-attention layer is permutation-equivariant: shuffle the input tokens and the output shuffles in exactly the same way. That's great for sets — but terrible for language, where "dog bites man" and "man bites dog" mean very different things.

Vaswani et al. (2017) solved this by adding a positional encoding vector to each token embedding before the first attention layer. Their choice — sinusoids — turns out to have a beautiful algebraic property that we'll derive from scratch.


Setup — the encoding formula

For a model with hidden dimension dmodeld_{\text{model}}, frequency index kk runs from 00 to dmodel/21\lfloor d_{\text{model}}/2 \rfloor - 1. Define the angular frequency:

ωk=1100002k/dmodel\omega_k = \frac{1}{10000^{\,2k/d_{\text{model}}}}

Then the positional encoding at position pp is:

PE[p,  2k]=sin ⁣(pωk)PE[p,  2k+1]=cos ⁣(pωk)\begin{aligned} \mathrm{PE}[p,\;2k] &= \sin\!\left(p\,\omega_k\right) \\[4pt] \mathrm{PE}[p,\;2k+1] &= \cos\!\left(p\,\omega_k\right) \end{aligned}

Each adjacent pair of dimensions forms a unit circle in 2-D, rotating at angular speed ωk\omega_k as pp increases. Lower dimensions rotate fast (high frequency); higher dimensions rotate slowly (low frequency).

Intuition check: At k=0k = 0, ω0=1\omega_0 = 1, so the encoding completes a full revolution every 2π6.282\pi \approx 6.28 positions. At k=dmodel/21k = d_{\text{model}}/2 - 1 the period is roughly 10 000 positions — longer than almost any real sequence.

Computing the dot product

Take two positions ii and jj. Their dot product sums over all dimension pairs:

PE[i]PE[j]=k[sin(iωk)sin(jωk)+cos(iωk)cos(jωk)]\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_{k}\Bigl[\sin(i\omega_k)\sin(j\omega_k) + \cos(i\omega_k)\cos(j\omega_k)\Bigr]

Applying the cosine subtraction identity

Recall from trigonometry:

cos(AB)=cosAcosB+sinAsinB\cos(A - B) = \cos A\cos B + \sin A\sin B

Set A=jωkA = j\,\omega_k and B=iωkB = i\,\omega_k. Then each summand collapses:

sin(iωk)sin(jωk)+cos(iωk)cos(jωk)=cos ⁣((ji)ωk)\sin(i\omega_k)\sin(j\omega_k) + \cos(i\omega_k)\cos(j\omega_k) = \cos\!\bigl((j-i)\,\omega_k\bigr)

Substituting back into the sum gives the closed form:

PE[i]PE[j]=kcos ⁣((ji)ωk)\boxed{\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_{k}\cos\!\bigl((j-i)\,\omega_k\bigr)}

Why this proves translation invariance

The right-hand side depends only on the gap d=jid = j - i, not on ii or jj individually. Shift both positions by any constant Δ\Delta and the dot product is unchanged:

PE[0]PE[3]=PE[1]PE[4]=PE[2]PE[5]==kcos(3ωk)\mathrm{PE}[0]\cdot\mathrm{PE}[3] = \mathrm{PE}[1]\cdot\mathrm{PE}[4] = \mathrm{PE}[2]\cdot\mathrm{PE}[5] = \cdots = \sum_{k}\cos(3\,\omega_k)

Every pair at distance 3 gives exactly the same dot product, regardless of where in the sequence the pair sits.

This is why Vaswani et al. chose sinusoids. The inner product between two positional encodings is a function of their relative distance only. That gives the attention mechanism a uniform, translation-invariant signal about "how far apart" two tokens are — without any learned parameters.

Seeing it — the flat-line plot

The script below computes, for each distance dd, the dot product of every pair (i,i+d)(i,\, i+d) and plots them. Every line should be perfectly flat.

python
1import numpy as np
2import matplotlib.pyplot as plt
3
4seq_len = 15
5d_model = 8
6
7def get_positional_encoding(seq_len, d_model):
8    PE = np.zeros((seq_len, d_model))
9    for pos in range(seq_len):
10        for i in range(0, d_model, 2):
11            div_term = 10000 ** (i / d_model)
12            PE[pos, i]     = np.sin(pos / div_term)
13            PE[pos, i + 1] = np.cos(pos / div_term)
14    return PE
15
16PE = get_positional_encoding(seq_len, d_model)
17
18# For each selected distance d, collect dot products of ALL pairs (i, i+d)
19selected_distances = list(range(1, seq_len, 2))
20
21dot_products_dict = {}
22for d in selected_distances:
23    dot_products_dict[d] = [
24        float(np.dot(PE[i], PE[i + d]))
25        for i in range(seq_len - d)
26    ]
27
28for d, vals in dot_products_dict.items():
29    print(f"\nDistance {d} ({len(vals)} pairs):")
30    for i, v in enumerate(vals):
31        print(f"  ({i}, {i + d}) -> {v:.4f}")
32
33# Plot: one line per distance, x = starting index i, y = dot(PE[i], PE[i+d])
34plt.figure(figsize=(9, 5))
35for d, vals in dot_products_dict.items():
36    xs = list(range(len(vals)))
37    plt.plot(xs, vals, marker="o", label=f"distance {d}")
38
39plt.xlabel("Starting index i  (pair = (i, i+d))")
40plt.ylabel("Dot product  PE[i] · PE[i+d]")
41plt.title(f"Pairwise dot products per distance (d_model={d_model}, seq_len={seq_len})")
42plt.grid(True)
43plt.legend()
44plt.tight_layout()
45plt.show()

That flatness is not a numerical accident — it's the algebraic identity made visible. Each horizontal line corresponds to a fixed kcos(dωk)\sum_k \cos(d\,\omega_k).


Takeaway

The sinusoidal positional encoding from Attention Is All You Need has an elegant closed-form inner product:

PE[i]PE[j]=kcos ⁣((ji)ωk)\mathrm{PE}[i]\cdot\mathrm{PE}[j] = \sum_k \cos\!\bigl((j-i)\,\omega_k\bigr)

This single equation explains three things at once:

  • Translation invariance — relative distance, not absolute position, drives the inner product.
  • Smoothness — cosine varies continuously, so nearby positions have similar encodings.
  • No learned parameters — the frequencies are fixed by the closed form ωk=1/100002k/d\omega_k = 1/10000^{2k/d}.

In the next post we'll look at learned positional embeddings (as used in BERT) and see what you gain — and give up — when you replace the fixed sinusoids with a trainable lookup table.


← Back to seriesNext post — coming soon
© 2026 nirmata.dev — Writing about things worth understanding.