Rigging the Lottery: Making All Tickets Winners
authors: Utku Evci Trevor Gale Jacob Menick Pablo Samuel Castro Erich Elsen
1. 参考文章
2. 摘要
There is a large body of work on training dense networks to yield sparse networks for inference, but this limits the size of the largest trainable sparse model to that of the largest trainable dense model.
? 训练稠密网络得到稀疏网络来推理有很多工作,但是这限制了最大可训练稀疏模型到稠密模型的大小。
In this paper we introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods.
? 本文介绍训练过程中使用固定参数计数和固定计算开销的训练稀疏神经网络,而且相对于现有的稠密到稀疏训练方法不损失准确性。
We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques.
? 使用更少的浮点运算达到相同准确率。
3. 正文
[1]. Introduction
Multiple works have shown inference time speedups are possible using sparsity for both Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (ConvNets). Currently, the most accurate sparse models are obtained with techniques that require, at a minimum, the cost of training a dense model in terms of memory and FLOPs , and sometimes significantly more .
? 已经证明在RNN和CovnNet上使用稀疏的网络模型能够使得推断时间减少。但是现有的稀疏模型是需要从一个稠密模型训练过来的,内存与计算的需求有时候比普通训练更多。
This paradigm has two main limitations.
First, the maximum size of sparse models is limited to the largest dense model that can be trained; even if sparse models are more parameter efficient, we can’t use pruning to train models that are larger and more accurate than the largest possible dense models.
Second, it is inefficient; large amounts of computation must be performed for parameters that are zero valued or that will be zero during inference.
Additionally, it remains unknown if the performance of the current best pruning algorithms is an upper bound on the quality of sparse models. Gale et al. (2019) found that three different dense-to-sparse training algorithms all achieve about the same sparsity / accuracy trade-off. However, this is far from conclusive proof that no better performance is possible.
? 现有模型的局限?:
? 1、由于稀疏模型也是由稠密模型训练得来的,所以稀疏模型的大小受制于最大的稠密模型,我们也不能做到比最大稠密模型更加精确。
? 2、训练低效。0值也需要计算。
? 3、我们无法得知现有的剪枝算法得到的就是最优解。
we introduce a new method for training sparse models without the need of a “lucky” initialization; for this reason, we call our method “The Rigged Lottery” or RigL
? RigL相较于彩票假设优势在于彩票假设需要靠运气,开始随机选取子网络的时候需要正好选到中奖彩票,而RigL不需要,所以叫“被操纵的彩票”。
We make the following specific contributions:
We introduce RigL - an algorithm for training sparse
neural networks while maintaining memory and computational cost proportional to density of the network.
We perform an extensive empirical evaluation of RigL on computer vision and natural language tasks. We show that RigL achieves higher quality than all previous techniques for a given computational cost.
We show the surprising result that RigL can find more accurate models than the current best dense-to-sparse training algorithms.
We study the loss landscape of sparse neural networks and provide insight into why allowing the topology of nonzero weights to change over the course of training aids optimization.
? 本文主要贡献:
- RigL算法的内存需求和计算消耗与稠密程度成正比。
- RigL在计算机视觉与自然语言处理的广泛验证。
- RigL比现有的稀疏训练算法找到的模型更加准确。
- 对为什么非零权重的拓扑结构在训练中改变有利于优化。
[2] Related Works
-
Thimm & Fiesler (1995) : concluded that pruning weights based on magnitude was a simple and powerful technique. -
Strom(1997) introduced the idea of retraining the previously pruned network to increase accuracy. -
Han et al. (2016b)introduced multiple rounds of magnitude pruning and retraining. **relatively inefficient ** -
Narang et al. (2017) introduced gradual pruning, where connections are slowly removed over the course of a single round of training.
A diversity of approaches not based on magnitude pruning have also been proposed.
不基于大小的剪枝:Hessian、
L
0
L_0
L0? Regularization
- 结构化的剪枝方法(structured pruning methods):去除神经元或连接。结果是稠密网络模型且能够减少运算量,但是RigL使用的计算更少。
- 整个训练过程保持稀疏( sparsity throughout the entire training ),首次在DeepR中引入,初始化时连接被随机分配一个标志,当优化程序是标志翻转时权重置零且新权重随机激活。
- Sparse Evolutionary Training (SET):依照标准幅度剪枝权重,并随机添加回来。简单且实操效果好
- Dynamic Sparse Reparameterization (DSR):动态稀疏重新参数化。允许参数预算在不同层之间交换,允许不统一的稀疏度。可以将参数分不到最有用的地方去,但不过都分配到卷积层去了,使得计算量增大。
- Sparse Networks from Scratch (SNFS):从头开始的稀疏网络。使用每一个参数的动量作为增长权重的标准,带来了测试准确率的提高,与DSR存在相同问题,且零值参数也得计算动量和梯度,大大增进全方面的计算量,储存动量消耗大量内存,也是不被允许的。
- Single-Shot Network Pruning (SNIP):单次网络剪枝。找到一个具有一次性修剪的初始掩模,并使用参数的显著性分数来决定保留哪些参数。剪枝后,继续使用这个静态稀疏网络进行训练。
[3] Rigging the Lottery
RigL starts with a random sparse network, and at regularly spaced intervals it removes a fraction of connections based on their magnitudes and activates new ones using instantaneous gradient information. After updating the connectivity, training continues with the updated network until the next update. The main parts of our algorithm, Sparsity Distribution, Update Schedule, Drop Criterion, Grow Criterion, and the various options considered for each.
? RigL开始于随机的稀疏网络,基于数量大小每隔一段时间移去一部分连接,并使用瞬时梯度信息激活新的连接。更新连接后训练继续,直到下一次更新。
? 主要为稀疏分配、更新计划、去除标准、生长标准。
(0)Notation
符号 | 意义 |
---|
D | dataset |
x
i
x_i
xi? | samples |
y
i
y_i
yi? | targets |
∑
i
L
(
f
Θ
(
x
i
)
,
y
i
)
\sum_iL(f_\Theta(x_i),y_i)
∑i?L(fΘ?(xi?),yi?) | loss function |
f
Θ
(
.
)
f_\Theta(.)
fΘ?(.) | neural network with parameter $\Theta $ |
Θ
∈
R
N
\Theta \in R^N
Θ∈RN | parameters of the l-th layer are donated with
Θ
l
\Theta^l
Θl |
s
l
∈
(
0
,
1
)
s^l\in(0,1)
sl∈(0,1) | l-th layer sparsity |
θ
l
=
(
1
?
s
l
)
N
l
\theta^l=(1-s^l)N^l
θl=(1?sl)Nl | parameter of sparse layer |
S
=
∑
l
s
l
N
l
N
S= \frac{\sum_ls^lN^l}{N}
S=N∑l?slNl? | overall sparsity |
- attention : 注意
θ
与
Θ
的
区
别
\theta与\Theta的区别
θ与Θ的区别,大写的表示稠密的,小写的表示稀疏的。
(1)Sparsity Distribution
? 避免在层之间重新分配参数,以免使得FLOP的统计更麻烦,主要考虑以下三种方法:
Uniform: The sparsity s**l of each individual layer is equal to the total sparsity S.
? 除了第一层不作稀疏处理(把第一层变稀疏既不提高性能也不缩小网络)其他每一层稀疏度皆为总稀疏度。
Erdos-Renyi:
s
l
s^l
sl scales with
1
?
n
l
?
1
+
n
l
n
l
?
1
?
n
l
1-\frac{n^{l-1}+n^l}{n^{l-1}*n^l}
1?nl?1?nlnl?1+nl? , where
n
l
n^l
nl denotes number of neurons at layer l.
? 这使得稀疏层之间的连接数能够随着输入和输出通道数拓展。
This method modifies the original Erdos-Renyi formulation by including the kernel dimensions in the scaling factors. In other words, the number of parameters of the sparse convolutional layers are scaled proportional to……
? 公式很复杂,整体来说就是第二种方法加上对卷积核的考虑,全连接层较第二个是一样的。ERK向参数更多的层分配了更多地稀疏度,小的层稀疏度更低。
? 以上所有方法的bias与batch-norm的参数都是稠密的,因为这些参数随着神经元数量变化而且对总体网络大小影响可以忽略。
(2)Update Schedule
由以下参数决定:
-
Δ
T
\Delta T
ΔT: 两次更新之间iteration次数。
-
T
e
n
d
T_{end}
Tend?: 结束更新连接的iteration数。
-
α
\alpha
α: 更新连接的初始部分。
-
f
d
e
c
a
y
f_{decay}
fdecay?: 每次更新调用的函数,随着时间推移衰减更新后的连接比例。使用余弦退火比其他方法效果更好。
f
d
e
c
a
y
(
t
;
α
,
T
e
n
d
)
=
α
2
(
1
+
c
o
s
(
t
π
T
e
n
d
)
)
f_{decay}(t;\alpha,T_{end})=\frac{\alpha}{2}(1+cos(\frac{t\pi}{T_{end}}))
fdecay?(t;α,Tend?)=2α?(1+cos(Tend?tπ?))
(3)Drop criterion
使用
A
r
g
T
o
p
K
(
?
∣
θ
l
∣
,
(
1
?
s
l
)
N
l
)
ArgTopK(-|\theta^l|,(1-s^l)N^l)
ArgTopK(?∣θl∣,(1?sl)Nl)选出小标最小的参数裁剪。
(4)Grow Criterion
We grow the connections with highest magnitude gradients. Newly activated connections are initialized to zero and therefore don’t affect the output of the network. However they are expected to receive gradients with high magnitudes in the next iteration and therefore reduce the loss fastest. We attempted using other initialization like random values or small values along the gradient direction for the activated connections, however zero initialization brought the best results.
? 在梯度最大的神经元间生成连接,新生成的连接被初始化为0,就不会影响网络的输出,同时由于它们具有最大梯度,下一次iteration能够最快地减小损失。0值初始化比随机或小值初始化带来的结果更好。
This procedure can be applied to each layer in sequence and the dense gradients can be discarded immediately after selecting the top connections.
? 这个流程可以被依次应用于每一层,稠密网络的梯度可以在选择完最大的几个连接后立即抛弃,如果层太大,梯度计算可以使用在线计算或者只存储k个最大的,只要$\Delta T >\frac{1}{1-s} $剩余的梯度计算可以平摊,这个和需要计算并存储每个优化步骤的所有梯度的方法不同。
RigL算法小结
[4] Empirical Evaluation
We observed that stopping the mask updates prior to the end of training yields slightly better performance; therefore, we set
T
e
n
d
T_{end}
Tend? to 25k for ImageNet-2012 and 75k for CIFAR-10 training which corresponds to roughly 3/4 of the full training.
? 提前结束mask会有更好的表现,所以对于两个测试集我们在大约3/4的时候停止了。
The default number of training steps used for training dense networks might not be optimal for sparse training with dynamic connectivity. In our experiments we observe that sparse training methods benefit significantly from increased training steps. When increasing the training steps by a factor M, the anchor epochs of the learning rate schedule and the end iteration of the mask update schedule are also scaled by the same factor; we indicate this scaling with a subscript.(
e
.
g
.
R
i
g
L
M
×
e.g. RigL_{M\times}
e.g.RigLM×?)
? 稠密模型默认的训练次数对于动态连接的稀疏模型训练并不是最合适的。我们实际操作提高训练次数,并同步将其他提高,下标记为M×。
(1)ImageNet-2021
ResNet-50
MobileNet
Analysis
- 稀疏度分布的影响:ERK会自动将更少参数的层稀疏度降低,这对于保持高稀疏度网络的容量很重要,ERK比其他分配标准更好,但是需要两倍的计算量,这就是准确率与计算效率的平衡,也证明了稀疏度不统一的重要性。
- 更新计划和频率的影响:当掩膜每100个iteration更新一次且initial drop fraction为0.3或0.5的时候拥有最高的准确率。尽管更新的频率很低(1000iteration)RigL准确率仍然在73.5%以上。
- 动态链接的影响:静态稀疏训练会卡在局部最小值,与更好地解隔离。允许新的连接生长会带来更好地灵活性。RigL可以逃离局部最小值。
[5] Discussion and Conclusion
对于给定的计算量,RigL比现有的稠密到稀疏与稀疏到稀疏的训练算法准确率更高,适用于以下三个场景:
- 提升打算部署的稀疏网络的准确率;
- 提升太大以至于训练次数受限的稀疏网络模型准确率;
- 结合sparse primitives训练太大而不能训练的稀疏模型。
Appendix
彩票假设:随机初始化的密集神经网络包含一个初始化的子网,当经过隔离训练时,它可以匹配训练后最多相同迭代次数的原始网络的测试精度。
这个主要的想法是说明稀疏训练的可能性。就像买彩票一样 ,有用的票就那么几张,稠密训练的各种结构相当于彩票池 ,找到中奖的那几张彩票对应稀疏的结构。
任何密集、随机初始化的包含子网络(中奖彩票)的前馈网络 ,当隔离训练时,可以在相似的迭代次数内达到与原始网络相当的测试精度。结果中奖彩票的size仅为MNIST和CIFAR10几种全连接和卷积式前馈架构的10-20%,同时比原始网络学得更快,并达到更高的测试精度。
剪枝: 神经网络的参数很多,但其中有些参数对最终的输出结果贡献不大而显得冗余,将这些冗余的参数剪掉的技术称为剪枝。剪枝可以减小模型大小、提升运行速度,同时还可以防止过拟合。
剪枝分为one-shot和iteration剪枝:
- one-shot剪枝过程:训练模型–> 评估神经元(或者kernel、layer)的重要性–>去掉最不重要的神经元–> fine-tuning–>停止剪枝。
- iteration剪枝过程:训练模型–> 评估神经元(或者kernel、layer)的重要性–>去掉最不重要的神经元–> fine-tuning–>判断是不是要继续剪枝,如果是回到第二步(评估神经元的重要性),否则停止剪枝。
剪枝还分为结构化剪枝和非结构化剪枝:
-
结构化剪枝:直接去掉整个kernel的结构化信息; -
非结构化剪枝:考虑每个kernel的每个元素,删除kernel中不重要的参数;也称为稀疏剪枝。
|