开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> Self-Attention -> 正文阅读

[人工智能]Self-Attention

本文为李宏毅 2021 ML 课程的笔记

Sophisticated Input

Input is a set of vectors (e.g. 一系列的词向量 (word embedding), 语音…)

What is the output?

Every vector has a label. (Sequnce Lebeling) (e.g. 预测每个输入词汇对应的词性…)
The whole sequence has a label. (e.g. Sentiment analysis: 给定一个评论，判定它的情感为 positive 还是 negative)
Model decides the number of labels itself. (seq2seq) (e.g. 翻译)

Sequnce Lebeling

下面以词性判别为例。我们可以用一个 FC (Fully-connected network) 来进行词性判定，但显然存在问题：对于例子中的 I saw a saw，同一个 FC 肯定无法对两个 saw 输出不同的词性

Is it possible to consider the context?

FC can consider the neighbor. (Use a window to consider the context)

How to consider the whole sequence?

a window covers the whole sequence?
- Input Seq 是变长的，如果取最长的可能长度，会导致 FC 需要更多的参数，使得运算量变大，同时也更容易 overfitting
Solution: Self-attention !

Self-attention

如下图所示，先通过 Self-attention 输出多个向量，输出的每个向量 (下图中带黑框的向量) 都考虑了整个 seq 的信息；再将考虑了整个句子的向量输入 FC 得到最后的输出
也可以交替使用 self-attention 和 FC

Self-attention

Transformer: Attention is all you need.

Self-attention

Self-attention 的输入为 $a^i$ ，它可以是 input，也可以是某个 hidden layer 的输出；输出为 $b_i$ ，每个 $b_i$ 都考虑了所有输入的 $a_j(1\leq j\leq4)$

怎么输出 $b^1$ ?

Find the relevant vectors in a sequence: 首先我们想要找到所有和 $a^1$ 相关的输入向量以便生成第一个输出 $b^1$ ; 每两个向量之间的相关性由 $\alpha$ (attention score) 表示

怎么计算 attention score $\alpha$ ?

方法 1 (之后的讲解中都默认使用这种方法，这种方法也是最常见的): Dot-product
方法 2: Additive
- …

Dot-product 计算 Attention Score

在这里插入图片描述

一般也会跟自己计算关联性，即计算出 $k^1$ 后再与 $q^1$ 作点积运算得到 $\alpha_{1,1}$

计算出 $a^1$ 与每一个 vector 的关联性之后，会作一个 Softmax 操作
$\alpha_{1, i}^{\prime}=\exp \left(\alpha_{1, i}\right) / \sum_{j} \exp \left(\alpha_{1, j}\right)$

当然也不一定要用 Softmax，用其他的激活函数也可以，例如 ReLU

Extract information based on attention scores

根据 attention score (向量之间的关联性) 作加权平均以抽取全局信息
$b^1=\sum_i\alpha_{1,i}'v^i$
注意到， $b^1\sim b^4$ 可以并行产生

使用矩阵运算实现 Self-attention

计算 $q, k, v$ (矩阵操作)

可以把输入向量拼起来得到一个输入矩阵，直接乘上 $W^q$ / $W^k$ / $W^v$ 就可以计算出相应的 $q, k, v$

计算 attention score

在这里插入图片描述

计算输出 $b$

在这里插入图片描述

总结

在这里插入图片描述

Multi-head Self-attention (Different types of relevance)

2 heads as example

不同的 $q, k, v$ 负责捕捉不同种类的相关性
$q^i=W^qa^i\\ q^{i,1}=W^{q,1}q^i\\ q^{i,2}=W^{q,2}q^i$

$k, v$ 的计算与 $q$ 类似

在这里插入图片描述

Positional Encoding

No position information in self-attention (之前的 self-attention 没有考虑位置信息 (天涯若比邻))
- $\Rightarrow$ Each position has a unique positional vector $e^i$ (hand-crafted or leared from data)

将 $e^i$ 可视化 (hand-crafted)

Each column represents a positional vector $e_i$ . (positional vector in the paper “attention is all you need”)

有各式各样的方法可以产生 positional encoding

Applications

Widely used in Natural Langue Processing (NLP)!

Transformer: Attention Is All You Need
BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert is a yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street.

Self-attention for Speech

Truncated Self-attention: Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
在语音识别中，序列长度过长 (Speech is a very long vector sequence)，使得 Attention Matrix 过大，因此我们可以只考虑一部分的输入，对其作 self-attention

Self-attention for Image

An image can also be considered as a vector set.
e.g. Self-Attention GAN: Self-Attention Generative Adversarial Networks; DEtection Transformer (DETR)

Performance

Self-attention v.s. CNN

CNN: self-attention that can only attends in a receptive field
Self-attention: CNN with learnable receptive field (Self-attention 会考虑整张图片的信息而非局限在感受野中，相当于自己学出了一个十分复杂的感受野)
- CNN is simplified self-attention. Self-attention is the complex version of CNN.

推荐论文 (2019.11): On the Relationship between Self-Attention and Convolutional Layers (用数学严谨证明 CNN 是 Self-attention 的特例，只要给 Self-attention 设置合适的参数，它可以做到与 CNN 一样的事情)

Self-attention 更复杂，更易过拟合 (数据量大时，self-attention 更好)
下面的实验结果来自 paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (题目来源: 将一个 image 拆成了 16x16 个 patch，把每个 patch 想象成一个 word)

Self-attention v.s. RNN

RNN 基本可以被 Self-attention 取代了

Recurrent Neural Network (RNN)

在这里插入图片描述

Self-attention

在这里插入图片描述

推荐论文: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Self-attention for Graph

Consider edge: only attention to connected nodes (在计算 Attention Score 时，只考虑有边相连的结点，也就是只计算下图中蓝色方块对应的 Attention Score)
This is one type of Graph Neural Network (GNN)

每个 node 都是 1 个 vector，例如可以看作社交网络图

To learn more

Long Range Arena: A Benchmark for Efficient Transformers: 比较了各种 self-attention 的变形 (未来的重点是减少 self-attention 巨大的计算量)
- 下图中横轴代表运算速度；纵轴代表 performance