[人工智能] [2106] Video Super-Resolution Transformer

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> [2106] Video Super-Resolution Transformer -> 正文阅读

[人工智能][2106] Video Super-Resolution Transformer

paper
code
mathematical reasoning mainly from this paper

Content

Abstract

components in traditional Transformer design and their limitations

fully-connected self-attention layer (FCSA) neglect local information in video
ViTs split an image into several patches or tokens, which damage local spatial information since contents (eg. lines, edges, shapes, objects) divided into different tokens
token-wise feed-forward layer misalign features between video frames and ignore feature propagation across frames
this layer independently process each of input token embeddings without any interaction across frames

main contributions of VSR-Transformer

spatial-temporal convolutional attention (STCSA) layer: exploit locality and spatial-temporal data information through different layers
bidirectional optical flow-based feed-forward (BOFF) layer: use interaction across all frame embeddings for feature propagation and alignment

Preliminary

notation
a calligraphic letter $\mathcal{X}$ : a data sequence
a calligraphic letter $\mathcal{D}$ : a distribution
a bold upper case letter $\mathbf{X}$ : a matrix
a bold lower case letter $\mathbf{x}$ : a vector
a lower case letter x: an element of a matrix
$[T]$ : a set ${1, ..., T\}$
$\mathbf{1}\{.\}$ : an indicator function, where $\mathbf{1}\{A\}=1$ if A is true and $\mathbf{1}\{A\}=0$ if A is false
$\mathbb{E}_{\mathcal{D}}$ : an empirical expectation with respect to distribution $\mathcal{D}$

defination 1 (function distance) given a function $\mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ and a target function $f^{\ast}: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}$ , we define a distance between these 2 function as
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(f):=\mathbb{E}_{\mathbf{X}\sim\mathcal{D}}[l(f(\mathbf{X}), f^{\ast}(\mathbf{X}))]$

for ground truth $Y=f^{\ast}(\mathcal{D})$ , loss denoted by $\mathcal{L}_\mathcal{D}(f)$

defination 2 (k-pattern function) a function $\mathcal{X}\rightarrow\mathcal{Y}$ is a k-pattern if for some ${\{\pm\}}^k\rightarrow\mathcal{Y}$ and index $j^{\ast}: f()=g(x_{j^{\ast, ..., j^{\ast}+k}})$ . we call a function $h_{\mathbf{u}, \mathbf{W}}(\mathbf{x})=\sum_{j}\langle {\mathbf{u}}^{(j)}, {\mathbf{v}}_{\mathbf{W}}^{(j)}\rangle$ can learn a k-pattern function from a feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ of data $x$ with a layer ${\mathbf{u}}^{(j)}\in R^q$ if for $\epsilon>0$ , we have
$\mathcal{L}_{f^{\ast}, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}})\leq\epsilon$

feature ${\mathbf{v}}_{\mathbf{W}}^{(j)}$ learned by a convolutional attention network or a fully connected attention network parameterized by $\mathbf{W}$

$\implies$ any function that can capture locality of data mean it should learn a k-pattern function

video super-resolution (VSR)

given a LR video sequence $\{V_1, ..., V_T\}\sim\mathcal{D}$ , where $V_t\in\mathbb{R}^{3\times H\times W}$ is t-th LR frame, $\mathcal{D}$ is a distribution of videos
extract features $\mathcal{X}=\{X_1, ..., X_T\}$ from LR video frames, where $X_t\in\mathbb{R}^{C\times H\times W}$ is t-th feature
learn a non-linear mapping $F$ to reconstruct HR frames $\widehat{\mathcal{Y}}$ by utilizing spatial-temporal information across sequence
$\widehat{\mathcal{Y}}\triangleq(\widehat{Y}_1, ..., \widehat{Y}_T)=F(X_1, ..., X_T)$

given ground-truth HR frames $\mathcal{Y}=\{Y_1, ..., Y_T\}$ , where $Y_t$ is t-th HR frame
minimize a loss function between generated HR frame $\widehat{Y}_t$ and ground-truth HR frame $Y_t$
$\widehat{F}=\underset{F}{\arg\min}\mathcal{L}_\mathcal{D}(F)\triangleq\widehat{\mathbb{E}}_{\mathcal{D}, t\in[T]}[d(\widehat{Y}_t, Y_t)]$

where, $d(\cdot, \cdot)$ is a distance metric, such as L1-loss, L2-loss, Charbonnier loss

for VSR tasks, a sequence method can be used, such as RNN, LSTM, Transformer
note that Transformer gain particular interest since it avoid recursion and thus allow parallel computing in practice

transformer block

given an input feature $X\in\mathbb{R}^{d\times n}$ ( $d$ -dimensional embeddings of $n$ tokens)
transformer block is a sequence-to-sequence function, mapping a sequence $\mathbb{R}^{d\times n}$ to another sequence $\mathbb{R}^{d\times n}$
consist of 2 parts, one is a self-attention layer with a skip connection
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i(W_v^iX)SoftMax((W_k^iX)^T(W_q^iX))$

where, $W_o^i\in\mathbb{R}^{d\times m}$ is a linear layer, $W_v^i, W_k^i, W_q^i\in\mathbb{R}^{m\times d}$ are linear layers mapping feature to value, key, query, $h$ is heads number, $m$ is head size
the other is a token-wise feed-forward layer with a skip connection
$f_2(X)=LN(f_1(X)+W_2ReLU(W_1f_1(X)+b_1\mathbf{1}_n^T)+b_2\mathbf{1}_n^T)$

where, $W_1\in\mathbb{R}^{r\times d}, W_2\in\mathbb{R}^{d\times r}$ are linear layers, $b_1\in\mathbb{R}^r, b_2\in\mathbb{R}^d$ are bias, $r$ is hidden layer size of feed-forward layer

Method

model architecture

The framework of video super-resolution Transformer. Given a low-resolution (LR) video, we first use an extractor to capture features of the LR videos. Then, a spatial-temporal convolutional self-attention and an optical flow-based feed-forward network model a sequence of continuous representations. Note that these two layers both have skip connections. Last, the reconstruction network restores a high-resolution video from the representations and the up-sampling frames.

feature extractor capture features from LR input
transformer map features to a sequence of continuous representations
reconstruction restore HR videos from representations

loss function Charbonnier loss

Network architecture of the VSR-Transformer.

Network architecture of the feature extractor and reconstruction network.

T frames number, C channels number, H image height, W image width
I input channels number, O output channels number
CONV convolution, with K kernel size, S stride, P padding, G groups
PixelShuffle pixel shuffle with upscale factor of 2
LeakyReLU Leaky ReLU activation function with a negative slope of 0.01

spatial-temporal convolutional self-attention (STCSA)

drawbacks of FCSA

Q: whether FCSA layer learn k-patterns with gradient descent
theorem 1 we assume $m = 1$ and $\vert u_i\vert\leq1$ , and weights are initialized as some permutation invariant distribution over $\mathbb{R}^n$ , and for all $\mathbf{x}$ we have $h_{\mathbf{u}, \mathbf{W}}^{FCSA}\in[-1, 1]$ which satisfies definition 2. then, the following holds
$\mathbb{E}_{W\sim\mathcal{W}}\Vert\frac\partial{\partial\mathbf{W}}\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}}^{FCSA})\Vert_2^2\leq qn\min\{ \dbinom{n-1}{k}^{-1}, \dbinom{n-1}{k-1}^{-1}\}$

from theorem 1:

initial gradient is small, if $k=\Omega(\log{n})$ and fully connected attention layer is initialized as a permutation invariant distribution
fully connected attention layer result in gradient vanishing, if $q$ is not large enough
gradient descent will be “stuck” upon initialization, thus unable to learn k-pattern function

$\implies$ FCSA layer cannot use spatial information of each frame since local information not encoded in embeddings of all tokens

detailed structure in STCSA

Illustration of the spatial-temporal convolutional self-attention. The unfold operation is to extract sliding local patches from a batched input feature map, while the fold operation is to combine an array of sliding local patches into a large feature map.

given feature maps of input video frames $X\in\mathbb{R}^{T\times C\times H\times W}$
step 1: capture spatial information of each frame in $x$
$X\in\mathbb{R}^{T\times C\times H\times W}\stackrel{W_q, W_k, W_v}\longrightarrow Q, K, V\in\mathbb{R}^{T\times C\times H\times W}$

where, $W_q, W_k, W_v$ are 3 independent conv, with kernel_size=3, stride=1, padding=1
step 2: unfold features into sliding local $H_p\times W_p$ -size patches in each frame, and reshape into query, key, value matrix
$V\in\mathbb{R}^{T\times C\times H\times W}\stackrel{unfold}\longrightarrow\mathbb{R}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\stackrel{reshape}\longrightarrow\mathbb{R}^{n\_heads\times\frac{CH_pW_p}{n\_heads}\times T\frac{HW}{H_pW_p}}$

where, $n\_patches=\frac{HW}{H_pW_p}$ is patches number in each frame, $dim=CH_pW_p$ is dimension of each patch, $n\_heads$ is heads number
step 3: calculate similarity matrix and aggregate with value for attention matrix
$V)=softmax(\frac{Q^TK}{\sqrt{d}})V^T\in\mathbb{R}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}$

where, $d=\frac{CH_pW_p}{n\_heads}$ is hidden dimension
note that similarity matrix $Q^TK$ related to all embedding tokens of the whole video frames
step 4: reshape attention matrix, and fold tensors of updated sliding local patches into features
$Attention\in\mathbb{R}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}\stackrel{reshape}\longrightarrow\mathbb{R}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\stackrel{fold}\longrightarrow\mathbb{R}^{T\times C\times H\times W}$

step 5: obtain final features, and achieve output with a skip connection and a normalization
$Attention\in\mathbb{R}^{T\times C\times H\times W}\stackrel{W_o}\longrightarrow F\in\mathbb{R}^{T\times C\times H\times W}$

$f_1(X)=LN(X+F)\in\mathbb{R}^{T\times C\times H\times W}$

where, $W_o$ is a conv, with kernel_size=3, stride=1, padding=1

step 2 to step 4 inspired by COLA-Net
with a summary of steps above, STCSA formulated as
$f_1(X)=LN(X+\sum_{i=1}^hW_o^i\kappa_2(\underbrace{\kappa_1(W_v^iX)}_\text{v}softmax({\underbrace{\kappa_1(W_k^iX)}_\text{w}}^T\underbrace{\kappa_1(W_q^iX)}_\text{q})))$

where, $\kappa_1(\cdot), \kappa_2(\cdot)$ are unfold and fold operation, $h$ is heads number which set $h = 1$ for good performance

why STCSA is suitable

Q: how STCSA layer learn k-patterns with gradient descent
theorem 2 assume we initialize each element of weights uniformly drawn from
$\{\pm\frac1k\}$ . fix some $\delta>0$ , some k-pattern $f$ and some distribution $\mathcal{D}$ . then is $q>2^{k+3}\log(\frac{2^k}\delta)$ , and let $h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA}$ be a function satisfying definition 2, with probability at least $1-\delta$ over the initialization, when training a spatial-temporal convolutional self-attention layer using gradient descent with $\eta$ , we have
$\frac{1}{S}\sum_{s=1}^S\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})\leq{\eta}^2S^2nk^{\frac52}2^{k+1}+\frac{k^22^{2k+1}}{q\eta S}+\eta nqk$

from theorem 2:

loss $\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})$ will be small with finite $S$ steps in optimization, thus able to learn k-pattern function

$\implies$ STCSA layer with gradient descent can capture locality of each frame

spatial-temporal position encoding

VSRT is permutation-invariant, thus requiring precise spatial-temporal position information
3D fixed position encoding: 2 spatial positional information (horizontal and vertical), 1 temporal positional information
$i)=\begin{cases} \sin(pos\cdot{\alpha}_k) &\text{if } i=2k \\ cos(pos\cdot{\alpha}_k) &\text{if } i=2k+1 \end{cases}$

where, ${\alpha}_k=1/1000^{2k/\frac{d}3}$ , $k$ is an integer in $\frac{k}6)$ , $p o s$ is position in corresponding dimension, and $d$ is channel dimension size

bidirectional optical flow-based feed-forward (BOFF)

Illustration of the bidirectional optical flow-based feed-forward layer. Given a video sequence, we first bidirectionally estimate the forward and backward optical flows and wrap the feature maps with the responding optical flows. Then we learn a forward and backward propagation network to produce two sequences of features from concatenated wrapped features and LR frames. Last, we fusion these two feature sequences into one feature sequence.

given features $X\in\mathbb{R}^{T\times C\times H\times W}$ output by STCSA layer
step 1: learn bidirectional optical flows between neighboring frames
$\overleftarrow{O}_t=\begin{cases} spy(V_1, V_1) &\text{if } t=1 \\ spy(V_{t-1}, V_t) &\text{if } t\in(1, T] \end{cases}, \overrightarrow{O}_t=\begin{cases} spy(V_{t+1}, V_t) &\text{if } t\in[1, T) \\ spy(V_T, V_T) &\text{if } t=T \end{cases}$

where, $\overleftarrow{O}, \overrightarrow{O}\in\mathbb{R}^{T\times2\times H\times W}$ are backward and forward optical flows; $spy(\cdot, \cdot)$ is a function as SPyNet which is pre-trained and updated in training
step 2: obtain bidirectional features along with backward and forward propagation
$\overleftarrow{X}=warp(X, \overleftarrow{O}), \overrightarrow{X}=warp(X, \overrightarrow{O})$

where, $\overleftarrow{X}, \overrightarrow{X}\in\mathbb{R}^{T\times C\times H\times W}$ are backward and forward features
step 3 aggerate frames and warped features, and feed into 2-layer CNN for backward and forward propagation
$f_2(X)=LN(f_1(X)+fusion(\overleftarrow{W_1}ReLU(\overleftarrow{W_2}[V, \overleftarrow{X}])+\overrightarrow{W_1}ReLU(\overrightarrow{W_2}[V, \overrightarrow{X})]))$

where, $[\cdot, \cdot]$ is to aggregation operator, $\overleftarrow{W_1}, \overleftarrow{W_2}, \overrightarrow{W_1}, \overrightarrow{W_2}$ are weights of backward and forward networks
extend 2-layer networks to multi-layer networks
$f_2(X)=LN(f_1(X)+fusion(R_1(V, \overleftarrow{X})+R_2(V, \overrightarrow{X})))$

where, $R_1, R_2$ are flexible networks

Experiment

training dataset

REDS: 240 training, 30 validation, 30 testing (each with 100 consecutive frames)
Vimeo-90K: 4278 videos with 89800 high-quality clips (720p or higher)

testing dataset

Vid4
REDS4: clip 000, 011, 015, 020 from REDS
Vimeo-90K-T: from Vimeo-90K

experiment detail

degradation bicubic down-sampling
input randomly cropped into 64x64-size
data augmentation random horizontal flip, $90^{\circ}$ rotation
frame nomalized to 448x256-size
loss Charbonnier loss
optimizer Adam: $\beta_1=0.9, \beta_2=0.99$ , batchsize=2 per GPU, 600K iterations
learning rate init 2e-4, cosine decay to 1e-7
period as [300K, 300K, 300K, 300K], restart weight as [1, 0.5, 0.5, 0.5] on REDS; period as [200K, 200K, 200K, 200K, 200K, 200K], restart weight as [1, 0.5, 0.5, 0.5, 0.5, 0.5] on Vimeo-90K

result on REDS

Quantitative comparison (PSNR/SSIM) on REDS4 for 4x VSR. The results are tested on RGB channels. Red and blue indicate the best and the second best performance, respectively. “ $\dag$ ” means a method trained on 5 frames for a fair comparison.

Qualitative comparison on the REDS4 dataset for 4x VSR. Zoom in for the best view.

key findings

the highest PSNR and comparable SSIM
when training with 5 frames, BasicVSR and IconVSR worse than EDVR
$\implies$ BasicVSR and IconVSR rely much on aggregation of long-term sequence information
64-channel VSRT better performance than 128-channel EDVR-L
VSRT able to recover finer details and sharper edges

result on Vimeo-90K

Quantitative comparison (PSNR/SSIM) on Vimeo-90K-T for 4x VSR. Red and blue indicate the best and the second best performance, respectively.

Qualitative comparison on Vimeo-90K-T for 4x VSR. Zoom in for the best view.

key findings

the highest PSNR and SSIM
generalization ability on Vid4 of VSRT better than EDVR but worse than BasicVSR and IconVSR
$\impliedby$ BasicVSR and IconVSR tested on all frames, while VSRT and EDVR tested on 7 frames
$\impliedby$ a distribution bias between Vimeo-90K-T and Vid4
VSRT able to generate sharp and realistic HR frames

result on Vid4

Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels.

Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels. “ $\dag$ ” means a method trained and tested on 7 frames for a fair comparison.