神经网络中Attention和Transformer结构

通过几个核心问题，探索NLP领域关键模块的结构

问题1，写一个Attention的代码？

import torch
import torch.nn as nn

class AttentionBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(input_dim, num_heads)
        self.fc = nn.Linear(input_dim, hidden_dim)

    def forward(self, input, mask=None):
        attention_output, attention_weights = self.attention(input, input, input, mask=mask)
        output = self.fc(attention_output)
        return output, attention_weights

这个回答中，用了一个Pytorch现有的模块‘torch.nn.MultiheadAttention’，这不方便我们探索Attention的内部结构，因此我们抛出第二个问题。

问题2，写一个不需要该模块的实现

代码稍长

import torch
import torch.nn as nn

class AttentionBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.query = nn.Linear(input_dim, hidden_dim)
        self.key = nn.Linear(input_dim, hidden_dim)
        self.value = nn.Linear(input_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, input, mask=None):
        batch_size, sequence_length, _ = input.shape
        query = self.query(input)  # (batch_size, sequence_length, hidden_dim)
        key = self.key(input)  # (batch_size, sequence_length, hidden_dim)
        value = self.value(input)  # (batch_size, sequence_length, hidden_dim)

        # Calculate attention weights
        attention_weights = torch.matmul(query, key.transpose(1, 2))  # (batch_size, sequence_length, sequence_length)
        attention_weights = attention_weights / (hidden_dim ** 0.5)  # Scale attention weights
        if mask is not None:
            attention_weights = attention_weights.masked_fill(mask == 0, -1e9)  # Mask padded positions
        attention_weights = torch.softmax(attention_weights, dim=-1)  # Normalize attention weights

        # Calculate attention-weighted output
        attention_output = torch.matmul(attention_weights, value)  # (batch_size, sequence_length, hidden_dim)
        attention_output = self.fc(attention_output)  # (batch_size, sequence_length, hidden_dim)
        return attention_output, attention_weights

从代码中可以看出，attention的结构是由4个全连接层(fully-connected layer，FC)构成

将数据输入后。并行输入到其中3个FC，它们分别输出三个值Query，Key, Value，简写为Q,K,V。经过矩阵计算，大致为QxKxV, 其中QxK后防止梯度消失加了个数据约束操作(Scale，和数据维度成正相关)，再和V向乘。最终QKV的计算结果再进最后1个FC，得到结果。

另外一个重点，也就是Attention的精髓，就是有一个动态参数：attention_weights。这个参数设计初衷是解决机器翻译中长句子上下文关联问题，其作用是让模型在不同时间关注数据的不同部分。

问题3 写一个Transformer的实现

机器人回答代码如下：

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, dropout_p=0.1):
        super().__init__()
        self.attention = AttentionBlock(input_dim, hidden_dim, num_heads)
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input, mask=None):
        residual = input
        input, attention_weights = self.attention(input, mask=mask)
        input = self.dropout(input)
        input = self.norm1(input + residual)
        residual = input
        input = self.feed_forward(input)
        input = self.dropout(input)
        input = self.norm2(input + residual)
        return input, attention_weights