Sophisticated Input
What is the output?
- Every vector has a label. (Sequnce Lebeling) (e.g. 预测每个输入词汇对应的词性…)
- The whole sequence has a label. (e.g. Sentiment analysis: 给定一个评论,判定它的情感为 positive 还是 negative)
- Model decides the number of labels itself. (seq2seq) (e.g. 翻译)
Sequnce Lebeling
- 下面以词性判别为例。我们可以用一个 FC (Fully-connected network) 来进行词性判定,但显然存在问题:对于例子中的 I saw a saw,同一个 FC 肯定无法对两个 saw 输出不同的词性
Is it possible to consider the context?
- FC can consider the neighbor. (Use a window to consider the context)
How to consider the whole sequence?
- a window covers the whole sequence?
- Input Seq 是变长的,如果取最长的可能长度,会导致 FC 需要更多的参数,使得运算量变大,同时也更容易 overfitting
- Solution: Self-attention !
Self-attention
- 如下图所示,先通过 Self-attention 输出多个向量,输出的每个向量 (下图中带黑框的向量) 都考虑了整个 seq 的信息;再将考虑了整个句子的向量输入 FC 得到最后的输出
- 也可以交替使用 self-attention 和 FC
Self-attention
Transformer: Attention is all you need.
Self-attention
- Self-attention 的输入为
a
i
a^i
ai,它可以是 input,也可以是某个 hidden layer 的输出;输出为
b
i
b_i
bi?,每个
b
i
b_i
bi? 都考虑了所有输入的
a
j
(
1
≤
j
≤
4
)
a_j(1\leq j\leq4)
aj?(1≤j≤4)
怎么输出
b
1
b^1
b1 ?
- Find the relevant vectors in a sequence: 首先我们想要找到所有和
a
1
a^1
a1 相关的输入向量以便生成第一个输出
b
1
b^1
b1; 每两个向量之间的相关性由
α
\alpha
α (attention score) 表示
怎么计算 attention score
α
\alpha
α ?
- 方法 1 (之后的讲解中都默认使用这种方法,这种方法也是最常见的): Dot-product
- 方法 2: Additive
Dot-product 计算 Attention Score
一般也会跟自己计算关联性,即计算出
k
1
k^1
k1 后再与
q
1
q^1
q1 作点积运算得到
α
1
,
1
\alpha_{1,1}
α1,1?
- 计算出
a
1
a^1
a1 与每一个 vector 的关联性之后,会作一个 Softmax 操作
α
1
,
i
′
=
exp
?
(
α
1
,
i
)
/
∑
j
exp
?
(
α
1
,
j
)
\alpha_{1, i}^{\prime}=\exp \left(\alpha_{1, i}\right) / \sum_{j} \exp \left(\alpha_{1, j}\right)
α1,i′?=exp(α1,i?)/j∑?exp(α1,j?)
当然也不一定要用 Softmax,用其他的激活函数也可以,例如 ReLU
Extract information based on attention scores
- 根据 attention score (向量之间的关联性) 作加权平均以抽取全局信息
b
1
=
∑
i
α
1
,
i
′
v
i
b^1=\sum_i\alpha_{1,i}'v^i
b1=i∑?α1,i′?vi - 注意到,
b
1
~
b
4
b^1\sim b^4
b1~b4 可以并行产生
使用矩阵运算实现 Self-attention
计算
q
,
k
,
v
q,k,v
q,k,v (矩阵操作)
- 可以把输入向量拼起来得到一个输入矩阵,直接乘上
W
q
W^q
Wq /
W
k
W^k
Wk /
W
v
W^v
Wv 就可以计算出相应的
q
,
k
,
v
q,k,v
q,k,v
计算 attention score
计算输出
b
b
b
总结
Multi-head Self-attention (Different types of relevance)
2 heads as example
- 不同的
q
,
k
,
v
q,k,v
q,k,v 负责捕捉不同种类的相关性
q
i
=
W
q
a
i
q
i
,
1
=
W
q
,
1
q
i
q
i
,
2
=
W
q
,
2
q
i
q^i=W^qa^i\\ q^{i,1}=W^{q,1}q^i\\ q^{i,2}=W^{q,2}q^i
qi=Wqaiqi,1=Wq,1qiqi,2=Wq,2qi
k
,
v
k,v
k,v 的计算与
q
q
q 类似
Positional Encoding
- No position information in self-attention (之前的 self-attention 没有考虑位置信息 (天涯若比邻))
-
?
\Rightarrow
? Each position has a unique positional vector
e
i
e^i
ei (hand-crafted or leared from data)
将
e
i
e^i
ei 可视化 (hand-crafted)
- Each column represents a positional vector
e
i
e_i
ei?. (positional vector in the paper “attention is all you need”)
有各式各样的方法可以产生 positional encoding
Applications
Widely used in Natural Langue Processing (NLP)!
Bert is a yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street.
Self-attention for Speech
Self-attention for Image
Performance
Self-attention v.s. CNN
- CNN: self-attention that can only attends in a receptive field
- Self-attention: CNN with learnable receptive field (Self-attention 会考虑整张图片的信息而非局限在感受野中,相当于自己学出了一个十分复杂的感受野)
- CNN is simplified self-attention. Self-attention is the complex version of CNN.
Self-attention v.s. RNN
RNN 基本可以被 Self-attention 取代了
Recurrent Neural Network (RNN)
Self-attention
Self-attention for Graph
- Consider edge: only attention to connected nodes (在计算 Attention Score 时,只考虑有边相连的结点,也就是只计算下图中蓝色方块对应的 Attention Score)
- This is one type of Graph Neural Network (GNN)
每个 node 都是 1 个 vector,例如可以看作社交网络图
To learn more
Self-attention 最早用在 Transformer 上,所以很多时候 Transformer 就是指的 self-attention;而后来 Self-attention 的各种变形也都叫作 xx
f
o
r
m
e
r
former
former
|