Transformer 中的位置编码

封面图为原神 4 周年官方贺图。

本文旨在讨论Transformer中的几种位置编码，以直观感受其特性。本文对应此 Jupyter Notebook

环境准备

1

import numpy as np

首先生成一个模拟的tokens列表，为了突出位置编码的影响，tokens的初值为1。

1
2
3
4


dim = 2
len = 4
tokens = np.ones([len,dim])
print(tokens)

Output:

1
2
3
4


[[1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]]

位置编码

整型编码

直接使用数组下标标记位置。

1
2
3


int_pe = np.arange(0,len).reshape([len,1])
int_pe = np.broadcast_to(int_pe, [len,dim])
int_pe

Output:

1
2
3
4


array([[0, 0],
       [1, 1],
       [2, 2],
       [3, 3]])

1
2


int_pe_embed = np.add(tokens, int_pe)
int_pe_embed

Output:

1
2
3
4


array([[1., 1.],
       [2., 2.],
       [3., 3.],
       [4., 4.]])

非常直观的标记方法，缺点是随着tokens长度变长，位置编码会变得非常大

[0,1]浮点数编码

将数组下标压缩映射到[0,1]范围内以标记位置，避免位置编码过大的问题。

1
2
3


zeroone_pe = np.arange(0, 1, 1.0/len).reshape([len,1])
zeroone_pe = np.broadcast_to(zeroone_pe, [len,dim])
zeroone_pe

Output:

1
2
3
4


array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [0.75, 0.75]])

1
2


zeroone_pe_embed = np.add(tokens, zeroone_pe)
zeroone_pe_embed

Output:

1
2
3
4


array([[1.  , 1.  ],
       [1.25, 1.25],
       [1.5 , 1.5 ],
       [1.75, 1.75]])

确实解决了整数编码中位置编码过大的问题。然而，当序列长度不同时，tokens的相对距离会改变。这不是我们想要的。

二进制编码

考虑到tokens中是多维向量，比起为一个向量的每个维度都加相同的整数，为不同维度加上不同的位置编码可以容纳更多信息。因此考虑二进制编码，将下标转换为二进制向量作为位置编码。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


def binary_vector_array(dim, length):
    if length > 2 ** dim:
        raise ValueError("length 不能超过 2^dim")
    result = np.empty((length, dim), dtype=int)
    for i in range(length):
        # 格式化为二进制字符串，前导0补齐
        bin_str = format(i, f'0{dim}b')
        result[i] = [int(bit) for bit in bin_str]
    return result

binary_PE = binary_vector_array(dim,len)
binary_PE

Output:

1
2
3
4


array([[0, 0],
       [0, 1],
       [1, 0],
       [1, 1]])

1
2


binary_PE_embed = np.add(tokens, binary_PE)
binary_PE_embed

Output:

1
2
3
4


array([[1., 1.],
       [1., 2.],
       [2., 1.],
       [2., 2.]])

比起前面几种做法，这种做法更加充分利用了向量的空间。然而，这种方式编码出来的位置向量在空间中是离散的。下面是二进制编码位置向量的绘图，每个向量被绘制为3维空间中的一个点。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


import matplotlib.pyplot as plt

def PE_plt(points):
    fig = plt.figure()
    ax = fig.add_subplot(111)

    x = points[:, 0]
    y = points[:, 1]

    ax.scatter(x, y, c='r', marker='o')

    for idx, (xi,yi) in enumerate(points):
        ax.text(x=xi,y=yi, s=f'{idx}')

    ax.set_xlabel('X')
    ax.set_ylabel('Y')

    plt.show()

1

PE_plt(binary_PE)

Output:

1

<Figure size 640x480 with 1 Axes>

Output Image

可以看到，位置1和位置2明明是相邻位置，其位置向量距离却很远。这对于模型学习相邻位置的关系是不利的。而且将这些点按顺序连接起来，不是一个连续可微的函数，这使得它很难泛化，例如很难处理浮点数位置。

sin 编码

从二进制编码位置向量离散的问题，想到采用高维空间上的连续函数作为编码。其中一种思路是使用 $\sin$ 编码。第 $t$ 个 token 的 $\sin$ 编码形式： $$ PE_t=[\sin(\frac{t}{2^0}),\sin(\frac{t}{2^1}),…,\sin(\frac{t}{2^i}),…,\sin(\frac{t}{2^{dim-1}})] $$ 其中 $t$ 表示这是第 $t$ 个 token。

1
2
3
4
5
6
7


import math

sin_PE = np.empty([len,dim])
for t in range(len):
    sin_PE[t] = [math.sin(t/2**i) for i in range(dim)]

sin_PE

Output:

1
2
3
4


array([[0.        , 0.        ],
       [0.84147098, 0.47942554],
       [0.90929743, 0.84147098],
       [0.14112001, 0.99749499]])

可以看到，在 $\sin$ 编码中，越低维度的向量变化越剧烈，越高维度的向量变化越平缓。这和二进制编码正好相反，但是思路其实是一样的，和前面整型编码中所有维度向量以相同速度变化形成对比。

1
2


sin_PE_embed = np.add(tokens, sin_PE)
sin_PE_embed

Output:

1
2
3
4


array([[1.        , 1.        ],
       [1.84147098, 1.47942554],
       [1.90929743, 1.84147098],
       [1.14112001, 1.99749499]])

下面是 $\sin$ 编码的绘图

1

PE_plt(sin_PE)

Output:

1

<Figure size 640x480 with 1 Axes>

Output Image

可以看到，这个就没有距离突变，而且连接起来是一个连续可微的函数。不过，起始位置和结束位置离得太近，如果 tokens 继续增加，甚至有可能重合。要解决这个问题，只需要增大 $\sin$ 函数的周期。例如，采用 $\sin(\frac{t}{10000^{\frac{i}{dim-1}}})$

Sinusoidal 编码

这是原始 Transformer 论文中采用的编码方法。Sinusoidal 编码在 $\sin$ 编码的基础上，额外希望解决一个问题：能否使得已知位置 $t$ 的编码 $PE_t$ 和距离 $\Delta t$，通过线性变化就可以算出 $PE_{t+\Delta t}$？即满足： $$ PE_{t+\Delta t} = T_{\Delta t}PE_{t} $$ 其中 $T_{\Delta t}$ 是一个线性变化矩阵。

这样可以更清晰地编码 tokens 之间的相对位置，同时也有利于计算优化。

观察上述等式，可以联想到旋转矩阵。令

$$ T_{\Delta t} = \begin{pmatrix} \cos{\Delta t}&\sin{\Delta t}\\ -\sin{\Delta t}&\cos{\Delta t} \end{pmatrix}\\ PE_t = \begin{pmatrix} \sin{t}\\ \cos{t} \end{pmatrix} $$

则

$$ \begin{pmatrix} \sin(t+\Delta t)\\ \cos(t+\Delta t) \end{pmatrix}= \begin{pmatrix} \cos{\Delta t}&\sin{\Delta t}\\ -\sin{\Delta t}&\cos{\Delta t} \end{pmatrix} \begin{pmatrix} \sin{t}\\ \cos{t} \end{pmatrix} $$

模仿 $\sin$ 编码的方式，扩展到多维，则有 $$ PE_t = [\sin(w_0t), \cos(w_0t), \sin(w_1t), \cos(w_1t), …] $$ 在 Sinusoidal 编码中，采用 $$ PE_t = \begin{cases} \sin(\frac{t}{10000^{\frac{i}{dim}}}) & i=2k, k\in N \\ \cos(\frac{t}{10000^{\frac{i-1}{dim}}}) & i=2k+1, k\in N \end{cases} $$ Sinusoidal编码要求向量维度必须是偶数。

1
2
3
4
5
6
7
8


dim = 2
def gen_Sinusoidal_PE(len,dim):
    Sinusoidal_PE = np.empty([len, dim])
    for t in range(len):
        Sinusoidal_PE[t] = [math.sin(t/10000**(i/dim)) if i%2==0 else math.cos(t/10000**((i-1)/dim)) for i in range(dim)]
    return Sinusoidal_PE
Sinusoidal_PE = gen_Sinusoidal_PE(len,dim)
Sinusoidal_PE

Output:

1
2
3
4


array([[ 0.        ,  1.        ],
       [ 0.84147098,  0.54030231],
       [ 0.90929743, -0.41614684],
       [ 0.14112001, -0.9899925 ]])

1
2


Sinusoidal_PE_embed = np.add(tokens, Sinusoidal_PE)
Sinusoidal_PE_embed

Output:

1
2
3
4


array([[1.        , 2.        ],
       [1.84147098, 1.54030231],
       [1.90929743, 0.58385316],
       [1.14112001, 0.0100075 ]])

绘图如下：

1

PE_plt(Sinusoidal_PE)

Output:

1

<Figure size 640x480 with 1 Axes>

Output Image

和 $\sin$ 编码的图像有点像，毕竟都是三角函数。

Sinusoidal 编码还有一些其它优异性质。例如两个位置向量的点积只取决于 $\Delta t$，即两个位置向量的点积可以反映其距离。同时这个点积是无向的。下图展示了一个 Sinusoidal 编码中，中间位置编码和其它位置编码的点积结果。

1
2
3
4
5
6
7
8
9


orig_len = len
len = 9
Sinusoidal_PE = gen_Sinusoidal_PE(len, dim)
Sinusoidal_PE_dot_product = [np.dot(Sinusoidal_PE[i], Sinusoidal_PE[len//2]) for i in range(len)]
plt.plot(Sinusoidal_PE_dot_product, marker='o')
plt.xlabel("pos")
plt.ylabel("dot product")
plt.show()
len = orig_len

Output:

1

<Figure size 640x480 with 1 Axes>

Output Image

RoPE编码

RoPE是目前最流行的位置编码之一。

Sinusoidal 编码尽管生成的位置向量隐含了相对位置信息，拥有点积与距离有关的优秀特性，可是一但其位置向量加到 tokens 上，这个特性就消失了。那如果直接将旋转矩阵应用到 tokens 上呢？在RoPE编码中，不生成显式的位置向量，而是直接把旋转矩阵和 tokens 相乘。即： $$ x’_t = \begin{pmatrix} \cos{t\theta}&-\sin{t\theta}\\ \sin{t\theta}&\cos{t\theta} \end{pmatrix}x_t $$ 扩展到多维，自然就是更改 $\theta$ 为 $\theta_i$ 了。在 RoPE 中，令 $$ \theta_i = \begin{cases} \frac{1}{10000^{\frac{i}{dim}}} & i=2k, k\in N\\ \frac{1}{10000^{\frac{i-1}{dim}}} & i=2k+1, k\in N \end{cases} $$ 从而组合形成 $$ \begin{pmatrix} \cos{t\theta_0}&-\sin{t\theta_0}&0&0&\cdots\\ \sin{t\theta_0}&\cos{t\theta_0}&0&0&\cdots\\ 0&0&\cos{t\theta_1}&-\sin{t\theta_1}&\cdots\\ 0&0&\sin{t\theta_1}&\cos{t\theta_1}&\cdots\\ \cdots&\cdots&\cdots&\cdots&\ddots \end{pmatrix} $$ 这样的形式。

1
2
3
4
5
6


RoPE_embed = np.empty([len, dim])
for t in range(len):
    RoPE = np.array([[math.cos(t/(10000**(0/dim))), -math.sin(t/(10000**(0/dim)))],[math.sin(t/(10000**(0/dim))), math.cos(t/(10000**(0/dim)))]])
    print(RoPE)
    RoPE_embed[t] = np.matmul(RoPE, tokens[t].T).T
RoPE_embed

Output:

1
2
3
4
5
6
7
8


[[ 1. -0.]
 [ 0.  1.]]
[[ 0.54030231 -0.84147098]
 [ 0.84147098  0.54030231]]
[[-0.41614684 -0.90929743]
 [ 0.90929743 -0.41614684]]
[[-0.9899925  -0.14112001]
 [ 0.14112001 -0.9899925 ]]

1
2
3
4


array([[ 1.        ,  1.        ],
       [-0.30116868,  1.38177329],
       [-1.32544426,  0.49315059],
       [-1.1311125 , -0.84887249]])

1

PE_plt(RoPE_embed)

Output:

1

<Figure size 640x480 with 1 Axes>

Output Image

当然实际的 RoPE 编码不会像上面那样实现，而且在 Transformer 中，一般对 q, k 向量进行位置嵌入，而不是直接对原始 token 进行位置嵌入。详情可参考 LLAMA 的实现：https://github.com/meta-llama/llama-models/blob/a9c89c471f793423afd4cc3ca8671d6e56fe64cb/models/llama4/model.py#L89

为什么对 $q$, $k$ 向量进行位置嵌入呢？刚刚提到，RoPE 编码想要使得编码后的向量内积和距离相关。在 Transformer 中，q 和 k 向量正好需要进行内积操作： $$ Attention(q,k,v) = softmax(\frac{qk^t}{\sqrt{d_k}})v $$ 因此，在 $Attention$ 操作前对 $q$ 和 $k$ 向量进行 RoPE 编码，$qk^t$ 就隐含了相对位置信息，从而使得 $Attention$ 也隐含了相对位置信息。