摘要

弱监督方法中一般用生成attention map的方式为分割任务打下基础，但是attention map一般只关注最有辨识度的区域。我们发现在训练的不同阶段，实际上网络会关注目标的其他区域，也就是说attention map是一直变化着的，如果我们把训练不同阶段的attention map叠加在一起，会不会得到更好的结果呢？由此本文提出了online attention accumulation (OAA)策略。
在这里插入图片描述

1. Introduction

WSSS任务常见的那些套话我们就不多说了，感兴趣可以看我前面发的文章，这里谈谈作者对于分类网络在训练的不同阶段discriminative regions不断转换的原因分析：

一个强大的分类网络通常会对特定类别寻找一个通用的识别图案，这样以便很好地识别来自该类别的所有图像。因此，那些难以分类的训练样本会驱动网络在寻找常见图案方面作出改变，导致attention regions在网络收敛前不断变化。
在训练时，attention model生成的attention maps受先前输入的图像的影响，不同内容的图像和训练图像的输入顺序都会导致中间attention maps中discriminative regions的变化。

虽然OAA可以叠加不同训练阶段的discriminative regions，使得mask更加完整，但是和CAM相比，那些之前本来很discriminative的区域的attention就没有那么strong了，因此本文设计了 hybrid loss function (the combination of an enhanced loss and a constraint loss)通过把 cumulative attention
maps来作为软标签的方式努力训练一个完整的attention model。
总的来说，本文的contributions：1.OAA策略；2.hybrid loss function；3.实验

2. Related Work

3. Methodology

在这里插入图片描述

3.1. Attention Generation

我们使用CAM生成attention maps，即用最后一层卷积层输出的class-specific feature maps生成attention maps。基本结构见Fig. 2。
在这里插入图片描述
我们使用VGG-16作为backbone，预测的目标类别 $c$ 的概率可以由下式计算：
$p^c=\sigma(GAP(F^c))$
$F$ 是class-aware convolutional layer的输出， $\sigma(\cdot)$ 是sigmoid函数，整个网络的loss使用交叉熵损失。给定一张图像 $I$ ，为了得到attention maps，首先把feature map $F$ 送入ReLU层，然后做归一化，保证attention map的值在[0,1]范围内：
$A^c=\frac{ReLU(F^c)}{max(F^c)}$
接着就可以把不同阶段生成的attention maps送入OAA过程中了。

3.2. Online Attention Accumulation

当训练图像被送入不同的epochs中时，OAA从分类模型结合生成的不同的attention maps。如Fig. 2所示，对于给定训练图像 $I$ 的每个目标类别 $c$ ，我们建立一个cumulative attention map $M^c$ 来保存发现的discriminative regions。OAA首先用第一个epoch的类别c的attention map $A_1$ 初始化cumulative attention map $M_1$ 。然后，当图像第二次被送入网络后，OAA结合 $M_1$ 和新生成的attention map $A_2$ 更新cumulative attention map，过程如下所示：
$M_2=AF(M_1,A_2)$
$AF(\cdot)$ 表示attention fusion strategy。同样，在第t个epoch就是这样的：
$M_t=AF(M_{t-1},A_t)$
OAA不断重复这样的过程。接下来详细讲讲 $AF(\cdot)$ ，也就是attention fusion strategy的实现。其实这个操作也很简单，就是element-wise maximum operation，即选择 $A_t,M_{t-1}$ 中每个像素最大的attention values，用公式表示就是：
$M_t=AF(M_{t-1},A_t)=max(M_{t-1},A_t)$
从Fig. 5中可以看出，OAA得到了更完整的区域。
在这里插入图片描述
我们也尝试过用averaging fusion strategy代替maximum fusion strategy，但是mIoU下降了1.6%。

但是这里还有一个问题，就是一开始分类模型可能比较弱，这样生成的attention map可能包含噪音，因此作者用目标类别的预测probability来决定是否要选用当前的attention map。也就是说，如果target category的score明显高于non-target categories的score，我们就计算它的attention map，否则就丢弃掉以避免噪音。

3.3. Towards Integral Attention Learning

OAA的缺点在于有些目标区域的attention values很低，不足以被分类模型加强，对于这种情况，我们提出一个新的loss函数，把cumulative attention maps作为监督信息，去训练一个attention module以进一步改善OAA，我们称之为 $OAA^+$ 。

具体来讲，我们将cumulative attention maps作为soft labels。每个attention value被视作当前点属于目标类别的概率，我们用Fig. 2中去掉GAP层的分类网络和分类损失作为integral attention model。给定class-aware convolutional layer产生的score map $\hat F$ ，位置 $j$ 属于类别 $c$ 的概率被表示为 $q_j^c=\sigma(\hat F_j^c)$ ， $\sigma$ 是sigmoid函数。因此，类别c的多标签交叉熵损失可由如下表示：
$-\frac{1}{|N|}\sum_{j\in N}(p_j^clog(q_j^c)+(1-p_j^c)log(1-q_j^c))$
其中， $p_j^c$ 表示归一化后的cumulative attention maps的值。这样我们就得到了被加强的cumulative attention maps。然而用上述的交叉熵损失往往只能让attention maps覆盖到部分的目标区域。这是因为损失函数更愿意让有低class-specific attention values（ $p_j^c<1-p_j^c$ ）成为类别c的背景。

因此，我们提出了一个hybrid loss。给定了cumulative attention map，我们先把它分成soft enhance regions $N_+^c$ 和soft constraint regions $N_-^c$ ， $N_-^c$ 包括了 $p_j^c=0$ 的像素， $N_+^c$ 包括其他的像素。对于 $N_+^c$ ，我们移除上面交叉熵损失函数的最后一项来进一步提升attention regions，但不抑制low attention values的区域。这样损失函数就变成：
$\mathcal L_+^c=-\frac{1}{N_+^c}\sum_{j\in N_+^c}p_j^clog(q_j^c)$
由于我们只有image-level的标注，cumulative attention map中的attention regions经常含有非目标像素，因此在上式中，我们用 $p_j^c$ 作为标签而不是用1作为标签。对于 $N_-^c$ 中 $p_j^c=0$ 的地方，loss函数如下所示：
$\mathcal L_-^c=-\frac{1}{N_-^c}\sum_{j\in N_-^c}p_j^clog(1-q_j^c)$
总体loss就是上面两个相加：
$\mathcal L=\sum_{c\in C}(\mathcal L_+^c+\mathcal L_-^c)$
Fig. 5展示了效果。