[人工智能] [2108] [ICCV 2021 Workshop] SwinIR: Image Restoration Using Swin Transformer

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> [2108] [ICCV 2021 Workshop] SwinIR: Image Restoration Using Swin Transformer -> 正文阅读

[人工智能][2108] [ICCV 2021 Workshop] SwinIR: Image Restoration Using Swin Transformer

paper
supp
code

Content

Abstract

process image with local attention mechanism
capture long-range dependency with shifted window MSA
better performance than SOTA, less parameter

PSNR results vs the total number of parameters of different methods for image SR ( $\times4$ ) on Set5

Method

model architecture

The architecture of the proposed SwinIR for image restoration.

shallow feature extraction
given LQ input $I_{LQ}\in\Reals^{H\times W\times C_{in}}$ , extract shallow features $F_0\in\Reals^{H\times W\times C}$
$F_0=H_{SF}(I_{LQ})$

where, $C$ is feature channel number, $H_{SF}(\cdot)$ is a $3\times3$ conv layer

deep feature extraction
extract deep features $F_D\in\Reals^{H\times W\times C}$ from $F_0$
$F_D=H_{DF}(F_0)$

where, $H_{DF}$ consists of $K$ RSTB and a conv layer
specifically, intermediate features $F_1, F_2, ..., F_K$ and output features $F_D$ as
$\begin{aligned} F_i&=H_{RSRB_i}(F_{i-1}), i=1, 2, ..., K \\ F_D&=H_{conv}(F_K) \end{aligned}$

where, $H_{RSRB_i}(\cdot)$ is $i$ -th RSTB, $H_{conv}$ is a $3\times3$ conv layer

reconstruction
aggregate shallow and deep features to reconstruct HQ image $I_{RHQ}$
$I_{RHQ}=H_{REC}(F_0+F_D)$

where, $H_{REC}(\cdot)$ is a reconstruction module

for super-resolution, a sub-pixel conv for up-sampling
for artifact reduction and denoising, a single conv

loss function
for super-resolution, use $L_1$ pixel loss
$\mathcal{L}=\Vert I_{RHQ}-I_{HQ}\Vert_1$

where, $I_{RHQ}$ is obtained by network from $I_{LQ}$ , $I_{HQ}$ is ground-truth HQ image

for artifact reduction and denoising, use Charbonnier loss
$\mathcal{L}=\sqrt{(I_{RHQ}-I_{HQ})^2-{\epsilon}^2}$

where, I_{RHQ} is obtained by network from $I_{LQ}$ , $I_{HQ}$ is ground-truth HQ image, $\epsilon$ is s constant set to $10^{-3}$

residual Swin transformer block (RSTB)

residual Swin transformer block (RSTB): $L$ Swin transformer layer (STL), a convolutional layer

given input features $F_{i, 0}$ of $i$ -th RSTB
extract intermediate features $F_{i, 1}, F_{i, 2}, ..., F_{i, L}$ by $L$ STL
$F_{i, j}=H_{STL_{i, j}}(F_{i, j-1}), j=1, 2, ..., L$

where, $H_{STL_{i, j}}(\cdot)$ is $j$ -th STL in $i$ -th RSTB

add a conv layer before residual connection
$F_{i, out}=H_{conv_i}(F_{i, L})+F_{i, 0}$

where, $H_{conv_i}(\cdot)$ is a conv layer in $i$ -th RSTB

2 benefits of design mentioned above

convolution with spatially invariant filters enhance translational equivariance
note that transformer viewed as spatially varying convolution
residual connection aggregate different levels of features

Swin transformer layer (STL)

given an input $F\in\Reals^{H\times W\times C}$
partition input into $F\in\Reals^{\frac{HW}{M^2}\times M^2\times C}$ features with non-overlapping $M\times M$ windows
where, $\frac{HW}{M^2}$ is windows number

compute standard self-attention separately for each window
produce query, key, value matrices $Q, K, V$ , for a local window feature $X\in\Reals^{M^2\times C}$
$Q=XP_Q, K=XP_K, V=XP_V$

where, $P_Q, P_K, P_V$ are projection matrices shared across windows
compute attention matrix by self-attention in a local window
$\mathrm{Attention}(Q, K, V)=\mathrm{SoftMax}(\frac{QK^T}{\sqrt{d}}+B)V$

where, $B$ is learnable relative positional encoding

$\mathrm{MLP}$ consist of 2 FC layers with GELU between them
$\mathrm{LN}$ layer added before both $\mathrm{MSA}$ and $\mathrm{MLP}$
residual connection employed for both modules

to sum up, whole STL formulated as
$\begin{aligned} X&=\mathrm{MSA}(\mathrm{LN}(X))+X \\ X&=\mathrm{MLP}(\mathrm{LN}(X))+X \end{aligned}$

shifted window partitioning used alternately for cross-window connections
shift feature by $(\lfloor\frac{M}2\rfloor, \lfloor\frac{M}2\rfloor)$ pixels before window partitioning

Experiment

datasets DIV2K and Flickr2K

super-resolution

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR ( $\times8$ ) on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of bicubic image SR ( $\times4$ ) methods. Best viewed by zooming.

Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of real-world image SR ( $\times4$ ) methods on real-world images.

artifact reduction

Quantitative comparison (average PSNR/SSIM/PSNR-B) with state-of-the-art methods for JPEG compression artifact reduction on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

denoising

Quantitative comparison (average PSNR) with state-of-the-art methods for grayscale image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of grayscale image denoising (noise level 50) methods on image “Monarch” from Set12.

Quantitative comparison (average PSNR) with state-of-the-art methods for color image denoising on benchmark datasets. Best and second best performance are in red and blue colors, respectively.

Visual comparison of color image denoising (noise level 50) methods on image “163085” from CBSD68.

ablation studies

Ablation study on RSTB design.

Ablation study on different settings of SwinIR. Results are tested on Manga109 for image SR ( $\times2$ ).

key findings

from (e) training data scale
- different from IPT which heavily relied on large training datasets, SwinIR achieve better results than RCAN using the same training data, even when dataset is small
from (f) model convergence
- SwinIR converge faster and better than RCAN, contradictory to previous observations that transformer often suffer from slow model convergence