paper code
Abstract
demerits of self-attention (SA)
- treat images as 1D sequence and neglect 2D structure
- quadractic complexity is expensive for HR images
- achieve spatial adaptability but ignore channel adaptability
demrit of multiple layer perceptron (MLP)
- sensitive to input size and only process fixed-size images
- consider global information but ignore local structure
contributions
- propose large kernel attention (LKA)
local structure information, long-range dependence, adaptability in channel dimension - present visual attention network (VAN) as backbone based on LKA
SOTA performance, less parameters and FLOPS
Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.
Method
large kernel attention (LKA)
large-kernel convolution brings a huge amount of computational overhead and parameters solution decompose a large kernel convolution
Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution (
1
×
1
1\times1
1×1 Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a
13
×
13
13\times13
13×13 convolution is decomposed into a
5
×
5
5\times5
5×5 depth-wise convolution, a
5
×
5
5\times5
5×5 depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.
a
K
×
K
K\times K
K×K large kernel convolution divided into 3 components
- a spatial local convolution:
(
2
d
?
1
)
×
(
2
d
?
1
)
(2d-1)\times(2d-1)
(2d?1)×(2d?1) depth-wise conv
??
?
??
\implies
? local contextual information
- a spatial long-range convolution:
?
K
d
?
×
?
K
d
?
\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil
?dK??×?dK?? depth-wise dilated
conv
??
?
??
\implies
? large receptive field - a channel convolution:
1
×
1
1\times1
1×1 conv
??
?
??
\implies
? adaptility in channel
where,
K
K
K is kernel size,
d
d
d is dilation
Desirable properties belonging to convolution, self-attention and LKA.
write LKA module as
A
t
t
e
n
t
i
o
n
=
C
o
n
v
1
×
1
(
D
W
-
D
-
C
o
n
v
(
D
W
-
C
o
n
v
(
F
)
)
)
O
u
t
p
u
t
=
A
t
t
e
n
t
i
o
n
?
F
\begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned}
AttentionOutput?=Conv1×1?(DW-D-Conv(DW-Conv(F)))=Attention?F?
where,
F
∈
R
C
×
H
×
W
F\in\Reals^{C\times H\times W}
F∈RC×H×W is input features,
A
t
t
e
n
t
i
o
n
∈
R
C
×
H
×
W
Attention\in\Reals^{C\times H\times W}
Attention∈RC×H×W is attention map
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; ? the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that ? is designed for 1D sequences.
computational complexity
assumed input and output with same size
R
C
×
H
×
W
\Reals^{C\times H\times W}
RC×H×W
P
a
r
a
m
=
?
K
d
?
×
?
K
d
?
×
C
+
(
2
d
?
1
)
×
(
2
d
?
1
)
×
C
+
C
×
C
F
L
O
P
s
=
(
?
K
d
?
×
?
K
d
?
×
C
+
(
2
d
?
1
)
×
(
2
d
?
1
)
×
C
+
C
×
C
)
×
H
×
W
\begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned}
ParamFLOPs?=?dK??×?dK??×C+(2d?1)×(2d?1)×C+C×C=(?dK??×?dK??×C+(2d?1)×(2d?1)×C+C×C)×H×W?
where,
K
K
K is kernel size with default
K
=
21
K=21
K=21,
d
d
d is dilation with default
d
=
3
d=3
d=3
Comparison of parameters of different manners for a
21
×
21
21\times21
21×21 convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size
H
×
W
×
C
H\times W\times C
H×W×C. Note: Bias is omitted for simplifying format.
architecture variants
The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.
Experiment
image classification
dataset ImageNet-1K, with augmentation optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay
Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.
object detection and instance segmentation
framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN dataset COCO 2017
Object detection on COCO 2017 dataset. #P means parameter. RetinaNet
1
×
1\times
1× donates models are based on RetinaNet and we train them for 12 epochs.
Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN
1
×
1\times
1× donates models are based on Mask R-CNN and we train them for 12 epochs.
A
P
b
AP^b
APb and
A
P
m
AP^m
APm refer to bounding box AP and mask AP respectively.
Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of
1280
×
800
1280\times800
1280×800.
semantic segmentation
framework Semantic FPN, UperNet dataset ADE20K
Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size
512
×
512
512\times512
512×512 for Semantic FPN and
2048
×
512
2048\times512
2048×512 for UperNet.
ablation studies
architecture components
Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
- local structural information, long-range dependence, adaptability in channel dimension are all critial
- attention mechanism help network achieve adaptive property
kernel size and dilation
Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.
key findings
- decomposing a
21
×
21
21\times21
21×21 convolution work better than decomposing a
7
×
7
7\times7
7×7 convolution
??
?
??
\implies
? large kernel is critical for visual tasks - decomposing a larger
35
×
35
35\times35
35×35 convolution, gain is not obvious comparing with decomposing a
21
×
21
21\times21
21×21 convolution
visualization
Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.
key findings
- activation area is more accurate
- show obvious advantages when object is dominant in images
??
?
??
\implies
? ability to capture long-range dependence
|