[人工智能] Transformer | DETR目标检测中的位置编码position

本文主要描述的是DETR论文中的position_encoding，详细DETR论文解析可参考

Transformer不像RNN可以根据位置顺序接受和处理单词，所以为了得到词的位置信息，将位置信息添加到每个词的嵌入向量中，这称为位置编码。DETR中提供了两种编码方式，一种是正弦编码（PositionEmbeddingSine），一种是可以学习的编码(PositionEmbeddingLearned)，默认为正弦编码。

?如图，在用作输入的嵌入向量作为transformer的输入之前，将位置编码的值相加，将嵌入向量作为编码器的输入之前添加位置编码值的过程如下：

?transformer把sine和cosine函数的值加入embedding向量，加上词序信息。embedding向量和位置编码的相加，是通过句子矩阵和位置编码矩阵的相加运算完成的，俩个矩阵是通过聚集embedding形成的向量。

- d model:是transformer的一个超参数，是所有层的输出维度。在上图中为4，在论文《Attention is all you need》中为512.

根据上面的表达式，如果嵌入向量中每个维度的索引为偶数，则使用正弦函数的值，如果索引为奇数则使用余弦函数。

1.正弦编码

取出mask，对mask进行取反，因为编码方式为二维编码，我们对行、和列分别进行累加，作为每一个维度的编码，并进行归一化，转化为角度。同时我们假设编码的每一维度都由一个128维的向量组成。然后，我们按照如下正弦编码方式进行编码,对奇数求余弦，偶数求正弦。编码后，x_emding,y_emding的维度均为batch*h*w*128 。

位置编码PE和词向量的维度需要保持一致，才能之后相加。其中pos是词的输入的位置，i是维度，

- torch.pow() ：实现张量和标量之间逐元素求指数操作,或者在可广播的张量之间逐元素求指数操作.

2.DETR目标检测中的position_encoding.py

DETR中的Positional Embedding是一个固定值，Positional Embedding的代码如下，针对二维特征图的特点，DETR实现了自己的二维位置编码方式。

?为了使得网络感知到不同输入的位置信息，最直观的方式就是给第一个Feature赋值1 ，第二个Feature赋值2 ，但是这种赋值方式对于较大的输入是不友好的，因此有人提出使用正弦函数将值控制在?1和1 之间，但是正弦函数又具备周期性，可能会造成不同位置值相同的情况。因此作者将正弦函数扩展到d维向量，不同通道具备不同的波长.如上文公式。

换句话来说，pos是词向量在序列中的位置，而?i?是channel的index。对照代码，可以看出DETR是为二维特征图的?x?和?y?方向各自计算了一个位置编码，每个维度的位置编码长度为num_pos_feats（该数值实际上为hidden_dim的一半），对x或y，计算奇数位置的正弦，计算偶数位置的余弦，然后将pos_x和pos_y拼接起来得到一个NHWD的数组，再经过permute(0,3,1,2)，形状变为NDHW，其中D等于hidden_dim。这个hidden_dim是Transformer输入向量的维度，在实现上，要等于CNN backbone输出的特征图的维度。所以pos code和CNN输出特征的形状是完全一样的。

?其中的细节代码

- math.pi:是 python 中 math 函数库里的一个内建函数,主要表示圆周率。

#Transformers中使用的多头注意力组件
import math
import torch
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#嵌套张量初始化
nt = torch.nested_tensor([torch.randn((2, 6)), torch.randn((3, 6))], device=device)
print(nt)
#通过将每个底层张量填充为相同的形状，嵌套张量可以转换为常规张量。
pt = torch.nested.to_padded_tensor(nt, padding=0.0)
print(pt)
"""
Args参数:
    query: query of shape 形状的查询 (N, L_t, E_q)
    key: key of shape key的形状 (N, L_s, E_k)
    value: value of shape value的形状(N, L_s, E_v)
    nheads: number of heads in multi-head attention多头注意力中的头数
    W_q: Weight for query input projection of shape (E_total, E_q)形状的查询输入投影的权重 
    W_k: Weight for key input projection of shape (E_total, E_k)形状的关键输入投影的权重
    W_v: Weight for value input projection of shape (E_total, E_v)形状的价值输入投影的权重
    W_out: Weight for output projection of shape (E_out, E_total)形状的输出投影的权重
    b_q (optional): Bias for query input projection of shape E_total. Default: None    b_q（可选）。形状E_total的查询输入投影的偏置。默认值。无
    b_k (optional): Bias for key input projection of shape E_total. Default: None       b_k (可选): 形状E_total的关键输入投影的偏置。默认值。无
    b_v (optional): Bias for value input projection of shape E_total. Default: None     b_v (可选): 形状E_total的值输入投影的偏置。默认值。无
    b_out (optional): Bias for output projection of shape E_out. Default: None         b_out（可选）。形状E_out的输出投影的偏置。默认值。无
    dropout_p: dropout probability. Default: 0.0   dropout_p: dropout概率。默认值：0.0
    where:
        N is the batch size  N是批次大小
        L_t is the target sequence length (jagged)   L_t是目标序列的长度(锯齿状)
        L_s is the source sequence length (jagged)  L_s是源序列的长度(锯齿状)
        E_q is the embedding size for query     E_q是查询的嵌入大小
        E_k is the embedding size for key       E_k是键的嵌入大小
        E_v is the embedding size for value     E_v是值的嵌入大小
        E_total is the embedding size for all heads combined          E_total is the embedding size for all heads  
        E_out is the output embedding size     E_out是输出嵌入的大小
Returns返回:
    attn_output: Output of shape (N, L_t, E_out)   attn_output: 形状的输出(N, L_t, E_out)
"""
def mha_nested(query, key, value, nheads,
W_q, W_k, W_v, W_out,
b_q=None, b_k=None, b_v=None, b_out=None,
dropout_p=0.0):
    N = query.size(0)
    E_total = W_q.size(0)
    assert E_total % nheads == 0, "Embedding dim is not divisible by nheads"#嵌入的dim不能被nheads分割，必须是8的倍数
    E_head = E_total // nheads

    # apply input projection
    # (N, L_t, E_q) -> (N, L_t, E_total)
    query = F.linear(query, W_q, b_q)
    # (N, L_s, E_k) -> (N, L_s, E_total)
    key = F.linear(key, W_k, b_k)
    # (N, L_s, E_v) -> (N, L_s, E_total)
    value = F.linear(value, W_v, b_v)

    # reshape query, key, value to separate by head
    # (N, L_t, E_total) -> (N, L_t, nheads, E_head) -> (N, nheads, L_t, E_head)
    query = query.reshape(-1, -1, nheads, E_head).transpose(1, 2)
    # (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
    key = key.reshape(-1, -1, nheads, E_head).transpose(1, 2)
    # (N, L_s, E_total) -> (N, L_s, nheads, E_head) -> (N, nheads, L_s, E_head)
    value = value.reshape(-1, -1, nheads, E_head).transpose(1, 2)

    # query matmul key^T
    # (N, nheads, L_t, E_head) x (N, nheads, L_s, E_head)^T -> (N, nheads, L_t, L_s)
    keyT = key.transpose(-1, -2)
    attn_weights = torch.matmul(query, keyT)

    # scale down
    attn_weights = attn_weights * (1.0 / math.sqrt(E_head))

    # softmax
    attn_weights = F.softmax(attn_weights, dim=-1)

    # dropout
    if dropout_p > 0.0:
        attn_weights = F.dropout(attn_weights, p=dropout_p)

    # attention_weights matmul value
    # (N, nheads, L_t, L_s) x (N, nheads, L_s, E_head) -> (N, nheads, L_t, E_head)
    attn_output = torch.matmul(attn_weights, value)

    # merge heads
    # (N, nheads, L_t, E_head) -> (N, L_t, nheads, E_head) -> (N, L_t, E_total)
    attn_output = attn_output.transpose(1, 2).reshape(N, -1, E_total)

    # apply output projection
    # (N, L_t, E_total) -> (N, L_t, E_out)
    attn_output = F.linear(attn_output, W_out, b_out)

    return attn_output

?-?mask是一个位置掩码数组，对于一个没有经过zero_pad的图像，它的mask是一个全为0的数组。