paper supp code
Abstract
- process image with local attention mechanism
- capture long-range dependency with shifted window MSA
- better performance than SOTA, less parameter
 PSNR results vs the total number of parameters of different methods for image SR (
×
4
\times4
×4) on Set5
Method
model architecture
 The architecture of the proposed SwinIR for image restoration.
shallow feature extraction given LQ input
I
L
Q
∈
R
H
×
W
×
C
i
n
I_{LQ}\in\Reals^{H\times W\times C_{in}}
ILQ?∈RH×W×Cin?, extract shallow features
F
0
∈
R
H
×
W
×
C
F_0\in\Reals^{H\times W\times C}
F0?∈RH×W×C
F
0
=
H
S
F
(
I
L
Q
)
F_0=H_{SF}(I_{LQ})
F0?=HSF?(ILQ?)
where,
C
C
C is feature channel number,
H
S
F
(
?
)
H_{SF}(\cdot)
HSF?(?) is a
3
×
3
3\times3
3×3 conv layer
deep feature extraction extract deep features
F
D
∈
R
H
×
W
×
C
F_D\in\Reals^{H\times W\times C}
FD?∈RH×W×C from
F
0
F_0
F0?
F
D
=
H
D
F
(
F
0
)
F_D=H_{DF}(F_0)
FD?=HDF?(F0?)
where,
H
D
F
H_{DF}
HDF? consists of
K
K
K RSTB and a conv layer specifically, intermediate features
F
1
,
F
2
,
.
.
.
,
F
K
F_1, F_2, ..., F_K
F1?,F2?,...,FK? and output features
F
D
F_D
FD? as
F
i
=
H
R
S
R
B
i
(
F
i
?
1
)
,
i
=
1
,
2
,
.
.
.
,
K
F
D
=
H
c
o
n
v
(
F
K
)
\begin{aligned} F_i&=H_{RSRB_i}(F_{i-1}), i=1, 2, ..., K \\ F_D&=H_{conv}(F_K) \end{aligned}
Fi?FD??=HRSRBi??(Fi?1?),i=1,2,...,K=Hconv?(FK?)?
where,
H
R
S
R
B
i
(
?
)
H_{RSRB_i}(\cdot)
HRSRBi??(?) is
i
i
i-th RSTB,
H
c
o
n
v
H_{conv}
Hconv? is a
3
×
3
3\times3
3×3 conv layer
reconstruction aggregate shallow and deep features to reconstruct HQ image
I
R
H
Q
I_{RHQ}
IRHQ?
I
R
H
Q
=
H
R
E
C
(
F
0
+
F
D
)
I_{RHQ}=H_{REC}(F_0+F_D)
IRHQ?=HREC?(F0?+FD?)
where,
H
R
E
C
(
?
)
H_{REC}(\cdot)
HREC?(?) is a reconstruction module
- for super-resolution, a sub-pixel conv for up-sampling
- for artifact reduction and denoising, a single conv
loss function for super-resolution, use
L
1
L_1
L1? pixel loss
L
=
∥
I
R
H
Q
?
I
H
Q
∥
1
\mathcal{L}=\Vert I_{RHQ}-I_{HQ}\Vert_1
L=∥IRHQ??IHQ?∥1?
where,
I
R
H
Q
I_{RHQ}
IRHQ? is obtained by network from
I
L
Q
I_{LQ}
ILQ?,
I
H
Q
I_{HQ}
IHQ? is ground-truth HQ image
for artifact reduction and denoising, use Charbonnier loss
L
=
(
I
R
H
Q
?
I
H
Q
)
2
?
?
2
\mathcal{L}=\sqrt{(I_{RHQ}-I_{HQ})^2-{\epsilon}^2}
L=(IRHQ??IHQ?)2??2
?
where, I_{RHQ} is obtained by network from
I
L
Q
I_{LQ}
ILQ?,
I
H
Q
I_{HQ}
IHQ? is ground-truth HQ image,
?
\epsilon
? is s constant set to
1
0
?
3
10^{-3}
10?3
residual Swin transformer block (RSTB)
residual Swin transformer block (RSTB):
L
L
L Swin transformer layer (STL), a convolutional layer
given input features
F
i
,
0
F_{i, 0}
Fi,0? of
i
i
i-th RSTB extract intermediate features
F
i
,
1
,
F
i
,
2
,
.
.
.
,
F
i
,
L
F_{i, 1}, F_{i, 2}, ..., F_{i, L}
Fi,1?,Fi,2?,...,Fi,L? by
L
L
L STL
F
i
,
j
=
H
S
T
L
i
,
j
(
F
i
,
j
?
1
)
,
j
=
1
,
2
,
.
.
.
,
L
F_{i, j}=H_{STL_{i, j}}(F_{i, j-1}), j=1, 2, ..., L
Fi,j?=HSTLi,j??(Fi,j?1?),j=1,2,...,L
where,
H
S
T
L
i
,
j
(
?
)
H_{STL_{i, j}}(\cdot)
HSTLi,j??(?) is
j
j
j-th STL in
i
i
i-th RSTB
add a conv layer before residual connection
F
i
,
o
u
t
=
H
c
o
n
v
i
(
F
i
,
L
)
+
F
i
,
0
F_{i, out}=H_{conv_i}(F_{i, L})+F_{i, 0}
Fi,out?=Hconvi??(Fi,L?)+Fi,0?
where,
H
c
o
n
v
i
(
?
)
H_{conv_i}(\cdot)
Hconvi??(?) is a conv layer in
i
i
i-th RSTB
2 benefits of design mentioned above
- convolution with spatially invariant filters enhance translational equivariance
note that transformer viewed as spatially varying convolution - residual connection aggregate different levels of features
Swin transformer layer (STL)
given an input
F
∈
R
H
×
W
×
C
F\in\Reals^{H\times W\times C}
F∈RH×W×C partition input into
F
∈
R
H
W
M
2
×
M
2
×
C
F\in\Reals^{\frac{HW}{M^2}\times M^2\times C}
F∈RM2HW?×M2×C features with non-overlapping
M
×
M
M\times M
M×M windows where,
H
W
M
2
\frac{HW}{M^2}
M2HW? is windows number
compute standard self-attention separately for each window produce query, key, value matrices
Q
,
K
,
V
Q, K, V
Q,K,V, for a local window feature
X
∈
R
M
2
×
C
X\in\Reals^{M^2\times C}
X∈RM2×C
Q
=
X
P
Q
,
K
=
X
P
K
,
V
=
X
P
V
Q=XP_Q, K=XP_K, V=XP_V
Q=XPQ?,K=XPK?,V=XPV?
where,
P
Q
,
P
K
,
P
V
P_Q, P_K, P_V
PQ?,PK?,PV? are projection matrices shared across windows compute attention matrix by self-attention in a local window
A
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
S
o
f
t
M
a
x
(
Q
K
T
d
+
B
)
V
\mathrm{Attention}(Q, K, V)=\mathrm{SoftMax}(\frac{QK^T}{\sqrt{d}}+B)V
Attention(Q,K,V)=SoftMax(d
?QKT?+B)V
where,
B
B
B is learnable relative positional encoding
M
L
P
\mathrm{MLP}
MLP consist of 2 FC layers with GELU between them
L
N
\mathrm{LN}
LN layer added before both
M
S
A
\mathrm{MSA}
MSA and
M
L
P
\mathrm{MLP}
MLP residual connection employed for both modules
to sum up, whole STL formulated as
X
=
M
S
A
(
L
N
(
X
)
)
+
X
X
=
M
L
P
(
L
N
(
X
)
)
+
X
\begin{aligned} X&=\mathrm{MSA}(\mathrm{LN}(X))+X \\ X&=\mathrm{MLP}(\mathrm{LN}(X))+X \end{aligned}
XX?=MSA(LN(X))+X=MLP(LN(X))+X?
shifted window partitioning used alternately for cross-window connections shift feature by
(
?
M
2
?
,
?
M
2
?
)
(\lfloor\frac{M}2\rfloor, \lfloor\frac{M}2\rfloor)
(?2M??,?2M??) pixels before window partitioning
Experiment
datasets DIV2K and Flickr2K
super-resolution
 Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
 Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR (
×
8
\times8
×8) on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
 Visual comparison of bicubic image SR (
×
4
\times4
×4) methods. Best viewed by zooming.
 Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
 Visual comparison of real-world image SR (
×
4
\times4
×4) methods on real-world images.
artifact reduction
 Quantitative comparison (average PSNR/SSIM/PSNR-B) with state-of-the-art methods for JPEG compression artifact reduction on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
denoising
 Quantitative comparison (average PSNR) with state-of-the-art methods for grayscale image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
 Visual comparison of grayscale image denoising (noise level 50) methods on image “Monarch” from Set12.
 Quantitative comparison (average PSNR) with state-of-the-art methods for color image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.
 Visual comparison of color image denoising (noise level 50) methods on image “163085” from CBSD68.
ablation studies
 Ablation study on RSTB design.
 Ablation study on different settings of SwinIR. Results are tested on Manga109 for image SR (
×
2
\times2
×2).
key findings
- from (e) training data scale
- different from IPT which heavily relied on large training datasets, SwinIR achieve better results than RCAN using the same training data, even when dataset is small
- from (f) model convergence
- SwinIR converge faster and better than RCAN, contradictory to previous observations that transformer often suffer from slow model convergence
|