IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 人工智能 -> [2106] Video Super-Resolution Transformer -> 正文阅读

[人工智能][2106] Video Super-Resolution Transformer

paper
code
mathematical reasoning mainly from this paper

Abstract

components in traditional Transformer design and their limitations

  1. fully-connected self-attention layer (FCSA) neglect local information in video
    ViTs split an image into several patches or tokens, which damage local spatial information since contents (eg. lines, edges, shapes, objects) divided into different tokens
  2. token-wise feed-forward layer misalign features between video frames and ignore feature propagation across frames
    this layer independently process each of input token embeddings without any interaction across frames

main contributions of VSR-Transformer

  1. spatial-temporal convolutional attention (STCSA) layer: exploit locality and spatial-temporal data information through different layers
  2. bidirectional optical flow-based feed-forward (BOFF) layer: use interaction across all frame embeddings for feature propagation and alignment

Preliminary

notation
a calligraphic letter X \mathcal{X} X: a data sequence
a calligraphic letter D \mathcal{D} D: a distribution
a bold upper case letter X \mathbf{X} X: a matrix
a bold lower case letter x \mathbf{x} x: a vector
a lower case letter x: an element of a matrix
[ T ] [T] [T]: a set { 1 , . . . , T } \{1, ..., T\} {1,...,T}
1 { . } \mathbf{1}\{.\} 1{.}: an indicator function, where 1 { A } = 1 \mathbf{1}\{A\}=1 1{A}=1 if A is true and 1 { A } = 0 \mathbf{1}\{A\}=0 1{A}=0 if A is false
E D \mathbb{E}_{\mathcal{D}} ED?: an empirical expectation with respect to distribution D \mathcal{D} D

defination 1 (function distance) given a function f : R d × n → R d × n f: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n} f:Rd×nRd×n and a target function f ? : R d × n → R d × n f^{\ast}: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n} f?:Rd×nRd×n, we define a distance between these 2 function as
L f ? , D ( f ) : = E X ~ D [ l ( f ( X ) , f ? ( X ) ) ] \mathcal{L}_{f^{\ast}, \mathcal{D}}(f):=\mathbb{E}_{\mathbf{X}\sim\mathcal{D}}[l(f(\mathbf{X}), f^{\ast}(\mathbf{X}))] Lf?,D?(f):=EXD?[l(f(X),f?(X))]

for ground truth Y = f ? ( D ) Y=f^{\ast}(\mathcal{D}) Y=f?(D), loss denoted by L D ( f ) \mathcal{L}_\mathcal{D}(f) LD?(f)

defination 2 (k-pattern function) a function f : X → Y f: \mathcal{X}\rightarrow\mathcal{Y} f:XY is a k-pattern if for some g : { ± } k → Y g: {\{\pm\}}^k\rightarrow\mathcal{Y} g:{±}kY and index j ? : f ( ) = g ( x j ? , . . . , j ? + k ) j^{\ast}: f()=g(x_{j^{\ast, ..., j^{\ast}+k}}) j?:f()=g(xj?,...,j?+k?). we call a function h u , W ( x ) = ∑ j ? u ( j ) , v W ( j ) ? h_{\mathbf{u}, \mathbf{W}}(\mathbf{x})=\sum_{j}\langle {\mathbf{u}}^{(j)}, {\mathbf{v}}_{\mathbf{W}}^{(j)}\rangle hu,W?(x)=j??u(j),vW(j)?? can learn a k-pattern function from a feature v W ( j ) {\mathbf{v}}_{\mathbf{W}}^{(j)} vW(j)? of data x x x with a layer u ( j ) ∈ R q {\mathbf{u}}^{(j)}\in R^q u(j)Rq if for ? > 0 \epsilon>0 ?>0, we have
L f ? , D ( h u , W ) ≤ ? \mathcal{L}_{f^{\ast}, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}})\leq\epsilon Lf?,D?(hu,W?)?

feature v W ( j ) {\mathbf{v}}_{\mathbf{W}}^{(j)} vW(j)? learned by a convolutional attention network or a fully connected attention network parameterized by W \mathbf{W} W

?? ? ?? \implies ? any function that can capture locality of data mean it should learn a k-pattern function

video super-resolution (VSR)

given a LR video sequence { V 1 , . . . , V T } ~ D \{V_1, ..., V_T\}\sim\mathcal{D} {V1?,...,VT?}D, where V t ∈ R 3 × H × W V_t\in\mathbb{R}^{3\times H\times W} Vt?R3×H×W is t-th LR frame, D \mathcal{D} D is a distribution of videos
extract features X = { X 1 , . . . , X T } \mathcal{X}=\{X_1, ..., X_T\} X={X1?,...,XT?} from LR video frames, where X t ∈ R C × H × W X_t\in\mathbb{R}^{C\times H\times W} Xt?RC×H×W is t-th feature
learn a non-linear mapping F F F to reconstruct HR frames Y ^ \widehat{\mathcal{Y}} Y ? by utilizing spatial-temporal information across sequence
Y ^ ? ( Y ^ 1 , . . . , Y ^ T ) = F ( X 1 , . . . , X T ) \widehat{\mathcal{Y}}\triangleq(\widehat{Y}_1, ..., \widehat{Y}_T)=F(X_1, ..., X_T) Y ??(Y 1?,...,Y T?)=F(X1?,...,XT?)

given ground-truth HR frames Y = { Y 1 , . . . , Y T } \mathcal{Y}=\{Y_1, ..., Y_T\} Y={Y1?,...,YT?}, where Y t Y_t Yt? is t-th HR frame
minimize a loss function between generated HR frame Y ^ t \widehat{Y}_t Y t? and ground-truth HR frame Y t Y_t Yt?
F ^ = arg ? min ? F L D ( F ) ? E ^ D , t ∈ [ T ] [ d ( Y ^ t , Y t ) ] \widehat{F}=\underset{F}{\arg\min}\mathcal{L}_\mathcal{D}(F)\triangleq\widehat{\mathbb{E}}_{\mathcal{D}, t\in[T]}[d(\widehat{Y}_t, Y_t)] F =Fargmin?LD?(F)?E D,t[T]?[d(Y t?,Yt?)]

where, d ( ? , ? ) d(\cdot, \cdot) d(?,?) is a distance metric, such as L1-loss, L2-loss, Charbonnier loss

for VSR tasks, a sequence method can be used, such as RNN, LSTM, Transformer
note that Transformer gain particular interest since it avoid recursion and thus allow parallel computing in practice

transformer block

given an input feature X ∈ R d × n X\in\mathbb{R}^{d\times n} XRd×n ( d d d-dimensional embeddings of n n n tokens)
transformer block is a sequence-to-sequence function, mapping a sequence R d × n \mathbb{R}^{d\times n} Rd×n to another sequence R d × n \mathbb{R}^{d\times n} Rd×n
consist of 2 parts, one is a self-attention layer with a skip connection
f 1 ( X ) = L N ( X + ∑ i = 1 h W o i ( W v i X ) S o f t M a x ( ( W k i X ) T ( W q i X ) ) f_1(X)=LN(X+\sum_{i=1}^hW_o^i(W_v^iX)SoftMax((W_k^iX)^T(W_q^iX)) f1?(X)=LN(X+i=1h?Woi?(Wvi?X)SoftMax((Wki?X)T(Wqi?X))

where, W o i ∈ R d × m W_o^i\in\mathbb{R}^{d\times m} Woi?Rd×m is a linear layer, W v i , W k i , W q i ∈ R m × d W_v^i, W_k^i, W_q^i\in\mathbb{R}^{m\times d} Wvi?,Wki?,Wqi?Rm×d are linear layers mapping feature to value, key, query, h h h is heads number, m m m is head size
the other is a token-wise feed-forward layer with a skip connection
f 2 ( X ) = L N ( f 1 ( X ) + W 2 R e L U ( W 1 f 1 ( X ) + b 1 1 n T ) + b 2 1 n T ) f_2(X)=LN(f_1(X)+W_2ReLU(W_1f_1(X)+b_1\mathbf{1}_n^T)+b_2\mathbf{1}_n^T) f2?(X)=LN(f1?(X)+W2?ReLU(W1?f1?(X)+b1?1nT?)+b2?1nT?)

where, W 1 ∈ R r × d , W 2 ∈ R d × r W_1\in\mathbb{R}^{r\times d}, W_2\in\mathbb{R}^{d\times r} W1?Rr×d,W2?Rd×r are linear layers, b 1 ∈ R r , b 2 ∈ R d b_1\in\mathbb{R}^r, b_2\in\mathbb{R}^d b1?Rr,b2?Rd are bias, r r r is hidden layer size of feed-forward layer

Method

model architecture

2106_vsrt_f1
The framework of video super-resolution Transformer. Given a low-resolution (LR) video, we first use an extractor to capture features of the LR videos. Then, a spatial-temporal convolutional self-attention and an optical flow-based feed-forward network model a sequence of continuous representations. Note that these two layers both have skip connections. Last, the reconstruction network restores a high-resolution video from the representations and the up-sampling frames.

feature extractor capture features from LR input
transformer map features to a sequence of continuous representations
reconstruction restore HR videos from representations

loss function Charbonnier loss

2106_vsrt_t4
Network architecture of the VSR-Transformer.

2106_vsrt_t5
Network architecture of the feature extractor and reconstruction network.

T frames number, C channels number, H image height, W image width
I input channels number, O output channels number
CONV convolution, with K kernel size, S stride, P padding, G groups
PixelShuffle pixel shuffle with upscale factor of 2
LeakyReLU Leaky ReLU activation function with a negative slope of 0.01

spatial-temporal convolutional self-attention (STCSA)

drawbacks of FCSA

Q: whether FCSA layer learn k-patterns with gradient descent
theorem 1 we assume m = 1 m=1 m=1 and ∣ u i ∣ ≤ 1 \vert u_i\vert\leq1 ui?1, and weights are initialized as some permutation invariant distribution over R n \mathbb{R}^n Rn, and for all x \mathbf{x} x we have h u , W F C S A ∈ [ ? 1 , 1 ] h_{\mathbf{u}, \mathbf{W}}^{FCSA}\in[-1, 1] hu,WFCSA?[?1,1] which satisfies definition 2. then, the following holds
E W ~ W ∥ ? ? W L f , D ( h u , W F C S A ) ∥ 2 2 ≤ q n min ? { ( n ? 1 k ) ? 1 , ( n ? 1 k ? 1 ) ? 1 } \mathbb{E}_{W\sim\mathcal{W}}\Vert\frac\partial{\partial\mathbf{W}}\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}, \mathbf{W}}^{FCSA})\Vert_2^2\leq qn\min\{ \dbinom{n-1}{k}^{-1}, \dbinom{n-1}{k-1}^{-1}\} EWW??W??Lf,D?(hu,WFCSA?)22?qnmin{(kn?1?)?1,(k?1n?1?)?1}

from theorem 1:

  • initial gradient is small, if k = Ω ( log ? n ) k=\Omega(\log{n}) k=Ω(logn) and fully connected attention layer is initialized as a permutation invariant distribution
  • fully connected attention layer result in gradient vanishing, if q q q is not large enough
  • gradient descent will be “stuck” upon initialization, thus unable to learn k-pattern function

?? ? ?? \implies ? FCSA layer cannot use spatial information of each frame since local information not encoded in embeddings of all tokens

detailed structure in STCSA

2106_vsrt_f2
Illustration of the spatial-temporal convolutional self-attention. The unfold operation is to extract sliding local patches from a batched input feature map, while the fold operation is to combine an array of sliding local patches into a large feature map.

given feature maps of input video frames X ∈ R T × C × H × W X\in\mathbb{R}^{T\times C\times H\times W} XRT×C×H×W
step 1: capture spatial information of each frame in x x x
X ∈ R T × C × H × W ? W q , W k , W v Q , K , V ∈ R T × C × H × W X\in\mathbb{R}^{T\times C\times H\times W}\stackrel{W_q, W_k, W_v}\longrightarrow Q, K, V\in\mathbb{R}^{T\times C\times H\times W} XRT×C×H×W?Wq?,Wk?,Wv??Q,K,VRT×C×H×W

where, W q , W k , W v W_q, W_k, W_v Wq?,Wk?,Wv? are 3 independent conv, with kernel_size=3, stride=1, padding=1
step 2: unfold features into sliding local H p × W p H_p\times W_p Hp?×Wp?-size patches in each frame, and reshape into query, key, value matrix
Q , K , V ∈ R T × C × H × W ? u n f o l d R T × C H p W p × H W H p W p ? r e s h a p e R n _ h e a d s × C H p W p n _ h e a d s × T H W H p W p Q, K, V\in\mathbb{R}^{T\times C\times H\times W}\stackrel{unfold}\longrightarrow\mathbb{R}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\stackrel{reshape}\longrightarrow\mathbb{R}^{n\_heads\times\frac{CH_pW_p}{n\_heads}\times T\frac{HW}{H_pW_p}} Q,K,VRT×C×H×W?unfold?RT×CHp?Wp?×Hp?Wp?HW??reshape?Rn_heads×n_headsCHp?Wp??×THp?Wp?HW?

where, n _ p a t c h e s = H W H p W p n\_patches=\frac{HW}{H_pW_p} n_patches=Hp?Wp?HW? is patches number in each frame, d i m = C H p W p dim=CH_pW_p dim=CHp?Wp? is dimension of each patch, n _ h e a d s n\_heads n_heads is heads number
step 3: calculate similarity matrix and aggregate with value for attention matrix
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q T K d ) V T ∈ R n _ h e a d s × T H W H p W p × T H W H p W p Attention(Q, K, V)=softmax(\frac{Q^TK}{\sqrt{d}})V^T\in\mathbb{R}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}} Attention(Q,K,V)=softmax(d ?QTK?)VTRn_heads×THp?Wp?HW?×THp?Wp?HW?

where, d = C H p W p n _ h e a d s d=\frac{CH_pW_p}{n\_heads} d=n_headsCHp?Wp?? is hidden dimension
note that similarity matrix Q T K Q^TK QTK related to all embedding tokens of the whole video frames
step 4: reshape attention matrix, and fold tensors of updated sliding local patches into features
A t t e n t i o n ∈ R n _ h e a d s × T H W H p W p × T H W H p W p ? r e s h a p e R T × C H p W p × H W H p W p ? f o l d R T × C × H × W Attention\in\mathbb{R}^{n\_heads\times T\frac{HW}{H_pW_p}\times T\frac{HW}{H_pW_p}}\stackrel{reshape}\longrightarrow\mathbb{R}^{T\times CH_pW_p\times\frac{HW}{H_pW_p}}\stackrel{fold}\longrightarrow\mathbb{R}^{T\times C\times H\times W} AttentionRn_heads×THp?Wp?HW?×THp?Wp?HW??reshape?RT×CHp?Wp?×Hp?Wp?HW??fold?RT×C×H×W

step 5: obtain final features, and achieve output with a skip connection and a normalization
A t t e n t i o n ∈ R T × C × H × W ? W o F ∈ R T × C × H × W Attention\in\mathbb{R}^{T\times C\times H\times W}\stackrel{W_o}\longrightarrow F\in\mathbb{R}^{T\times C\times H\times W} AttentionRT×C×H×W?Wo??FRT×C×H×W

f 1 ( X ) = L N ( X + F ) ∈ R T × C × H × W f_1(X)=LN(X+F)\in\mathbb{R}^{T\times C\times H\times W} f1?(X)=LN(X+F)RT×C×H×W

where, W o W_o Wo? is a conv, with kernel_size=3, stride=1, padding=1

step 2 to step 4 inspired by COLA-Net
with a summary of steps above, STCSA formulated as
f 1 ( X ) = L N ( X + ∑ i = 1 h W o i κ 2 ( κ 1 ( W v i X ) ? v s o f t m a x ( κ 1 ( W k i X ) ? w T κ 1 ( W q i X ) ? q ) ) ) f_1(X)=LN(X+\sum_{i=1}^hW_o^i\kappa_2(\underbrace{\kappa_1(W_v^iX)}_\text{v}softmax({\underbrace{\kappa_1(W_k^iX)}_\text{w}}^T\underbrace{\kappa_1(W_q^iX)}_\text{q}))) f1?(X)=LN(X+i=1h?Woi?κ2?(v κ1?(Wvi?X)??softmax(w κ1?(Wki?X)??Tq κ1?(Wqi?X)??)))

where, κ 1 ( ? ) , κ 2 ( ? ) \kappa_1(\cdot), \kappa_2(\cdot) κ1?(?),κ2?(?) are unfold and fold operation, h h h is heads number which set h = 1 h=1 h=1 for good performance

why STCSA is suitable

Q: how STCSA layer learn k-patterns with gradient descent
theorem 2 assume we initialize each element of weights uniformly drawn from
{ ± 1 k } \{\pm\frac1k\} {±k1?}. fix some δ > 0 \delta>0 δ>0, some k-pattern f f f and some distribution D \mathcal{D} D. then is q > 2 k + 3 log ? ( 2 k δ ) q>2^{k+3}\log(\frac{2^k}\delta) q>2k+3log(δ2k?), and let h u ( s ) , W ( s ) S T C S A h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA} hu(s),W(s)STCSA? be a function satisfying definition 2, with probability at least 1 ? δ 1-\delta 1?δ over the initialization, when training a spatial-temporal convolutional self-attention layer using gradient descent with η \eta η, we have
1 S ∑ s = 1 S L f , D ( h u ( s ) , W ( s ) S T C S A ) ≤ η 2 S 2 n k 5 2 2 k + 1 + k 2 2 2 k + 1 q η S + η n q k \frac{1}{S}\sum_{s=1}^S\mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA})\leq{\eta}^2S^2nk^{\frac52}2^{k+1}+\frac{k^22^{2k+1}}{q\eta S}+\eta nqk S1?s=1S?Lf,D?(hu(s),W(s)STCSA?)η2S2nk25?2k+1+qηSk222k+1?+ηnqk

from theorem 2:

  • loss L f , D ( h u ( s ) , W ( s ) S T C S A ) \mathcal{L}_{f, \mathcal{D}}(h_{\mathbf{u}^{(s)}, \mathbf{W}^{(s)}}^{STCSA}) Lf,D?(hu(s),W(s)STCSA?) will be small with finite S S S steps in optimization, thus able to learn k-pattern function

?? ? ?? \implies ? STCSA layer with gradient descent can capture locality of each frame

spatial-temporal position encoding

VSRT is permutation-invariant, thus requiring precise spatial-temporal position information
3D fixed position encoding: 2 spatial positional information (horizontal and vertical), 1 temporal positional information
P E ( p o s , i ) = { sin ? ( p o s ? α k ) if? i = 2 k c o s ( p o s ? α k ) if? i = 2 k + 1 PE(pos, i)=\begin{cases} \sin(pos\cdot{\alpha}_k) &\text{if } i=2k \\ cos(pos\cdot{\alpha}_k) &\text{if } i=2k+1 \end{cases} PE(pos,i)={sin(pos?αk?)cos(pos?αk?)?if?i=2kif?i=2k+1?

where, α k = 1 / 100 0 2 k / d 3 {\alpha}_k=1/1000^{2k/\frac{d}3} αk?=1/10002k/3d?, k k k is an integer in [ 0 , k 6 ) [0, \frac{k}6) [0,6k?), p o s pos pos is position in corresponding dimension, and d d d is channel dimension size

bidirectional optical flow-based feed-forward (BOFF)

2106_vsrt_f3
Illustration of the bidirectional optical flow-based feed-forward layer. Given a video sequence, we first bidirectionally estimate the forward and backward optical flows and wrap the feature maps with the responding optical flows. Then we learn a forward and backward propagation network to produce two sequences of features from concatenated wrapped features and LR frames. Last, we fusion these two feature sequences into one feature sequence.

given features X ∈ R T × C × H × W X\in\mathbb{R}^{T\times C\times H\times W} XRT×C×H×W output by STCSA layer
step 1: learn bidirectional optical flows between neighboring frames
O ← t = { s p y ( V 1 , V 1 ) if? t = 1 s p y ( V t ? 1 , V t ) if? t ∈ ( 1 , T ] , O → t = { s p y ( V t + 1 , V t ) if? t ∈ [ 1 , T ) s p y ( V T , V T ) if? t = T \overleftarrow{O}_t=\begin{cases} spy(V_1, V_1) &\text{if } t=1 \\ spy(V_{t-1}, V_t) &\text{if } t\in(1, T] \end{cases}, \overrightarrow{O}_t=\begin{cases} spy(V_{t+1}, V_t) &\text{if } t\in[1, T) \\ spy(V_T, V_T) &\text{if } t=T \end{cases} O t?={spy(V1?,V1?)spy(Vt?1?,Vt?)?if?t=1if?t(1,T]?,O t?={spy(Vt+1?,Vt?)spy(VT?,VT?)?if?t[1,T)if?t=T?

where, O ← , O → ∈ R T × 2 × H × W \overleftarrow{O}, \overrightarrow{O}\in\mathbb{R}^{T\times2\times H\times W} O ,O RT×2×H×W are backward and forward optical flows; s p y ( ? , ? ) spy(\cdot, \cdot) spy(?,?) is a function as SPyNet which is pre-trained and updated in training
step 2: obtain bidirectional features along with backward and forward propagation
X ← = w a r p ( X , O ← ) , X → = w a r p ( X , O → ) \overleftarrow{X}=warp(X, \overleftarrow{O}), \overrightarrow{X}=warp(X, \overrightarrow{O}) X =warp(X,O ),X =warp(X,O )

where, X ← , X → ∈ R T × C × H × W \overleftarrow{X}, \overrightarrow{X}\in\mathbb{R}^{T\times C\times H\times W} X ,X RT×C×H×W are backward and forward features
step 3 aggerate frames and warped features, and feed into 2-layer CNN for backward and forward propagation
f 2 ( X ) = L N ( f 1 ( X ) + f u s i o n ( W 1 ← R e L U ( W 2 ← [ V , X ← ] ) + W 1 → R e L U ( W 2 → [ V , X → ) ] ) ) f_2(X)=LN(f_1(X)+fusion(\overleftarrow{W_1}ReLU(\overleftarrow{W_2}[V, \overleftarrow{X}])+\overrightarrow{W_1}ReLU(\overrightarrow{W_2}[V, \overrightarrow{X})])) f2?(X)=LN(f1?(X)+fusion(W1? ?ReLU(W2? ?[V,X ])+W1? ?ReLU(W2? ?[V,X )]))

where, [ ? , ? ] [\cdot, \cdot] [?,?] is to aggregation operator, W 1 ← , W 2 ← , W 1 → , W 2 → \overleftarrow{W_1}, \overleftarrow{W_2}, \overrightarrow{W_1}, \overrightarrow{W_2} W1? ?,W2? ?,W1? ?,W2? ? are weights of backward and forward networks
extend 2-layer networks to multi-layer networks
f 2 ( X ) = L N ( f 1 ( X ) + f u s i o n ( R 1 ( V , X ← ) + R 2 ( V , X → ) ) ) f_2(X)=LN(f_1(X)+fusion(R_1(V, \overleftarrow{X})+R_2(V, \overrightarrow{X}))) f2?(X)=LN(f1?(X)+fusion(R1?(V,X )+R2?(V,X )))

where, R 1 , R 2 R_1, R_2 R1?,R2? are flexible networks

Experiment

training dataset

  • REDS: 240 training, 30 validation, 30 testing (each with 100 consecutive frames)
  • Vimeo-90K: 4278 videos with 89800 high-quality clips (720p or higher)

testing dataset

  • Vid4
  • REDS4: clip 000, 011, 015, 020 from REDS
  • Vimeo-90K-T: from Vimeo-90K

experiment detail

  • degradation bicubic down-sampling
  • input randomly cropped into 64x64-size
  • data augmentation random horizontal flip, 9 0 ° 90^{\circ} 90° rotation
  • frame nomalized to 448x256-size
  • loss Charbonnier loss
  • optimizer Adam: β 1 = 0.9 , β 2 = 0.99 \beta_1=0.9, \beta_2=0.99 β1?=0.9,β2?=0.99, batchsize=2 per GPU, 600K iterations
  • learning rate init 2e-4, cosine decay to 1e-7
  • period as [300K, 300K, 300K, 300K], restart weight as [1, 0.5, 0.5, 0.5] on REDS; period as [200K, 200K, 200K, 200K, 200K, 200K], restart weight as [1, 0.5, 0.5, 0.5, 0.5, 0.5] on Vimeo-90K

result on REDS

2106_vsrt_t1
Quantitative comparison (PSNR/SSIM) on REDS4 for 4x VSR. The results are tested on RGB channels. Red and blue indicate the best and the second best performance, respectively. “ ? \dag ?” means a method trained on 5 frames for a fair comparison.

2106_vsrt_f4
2106_vsrt_f8
Qualitative comparison on the REDS4 dataset for 4x VSR. Zoom in for the best view.

key findings

  • the highest PSNR and comparable SSIM
  • when training with 5 frames, BasicVSR and IconVSR worse than EDVR
    ?? ? ?? \implies ? BasicVSR and IconVSR rely much on aggregation of long-term sequence information
  • 64-channel VSRT better performance than 128-channel EDVR-L
  • VSRT able to recover finer details and sharper edges

result on Vimeo-90K

2106_vsrt_t2
Quantitative comparison (PSNR/SSIM) on Vimeo-90K-T for 4x VSR. Red and blue indicate the best and the second best performance, respectively.

2106_vsrt_f5
2106_vsrt_f9
Qualitative comparison on Vimeo-90K-T for 4x VSR. Zoom in for the best view.

key findings

  • the highest PSNR and SSIM
  • generalization ability on Vid4 of VSRT better than EDVR but worse than BasicVSR and IconVSR
    ?? ? ?? \impliedby ? BasicVSR and IconVSR tested on all frames, while VSRT and EDVR tested on 7 frames
    ?? ? ?? \impliedby ? a distribution bias between Vimeo-90K-T and Vid4
  • VSRT able to generate sharp and realistic HR frames

result on Vid4

2106_vsrt_t3
Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels.

2106_vsrt_t7
Quantitative comparison (PSNR/SSIM) on Vid4 for 4x VSR. Red and blue indicate the best and the second best performance, respectively. “Y” denotes the evaluation on Y channels. “ ? \dag ?” means a method trained and tested on 7 frames for a fair comparison.

2106_vsrt_f10>Qualitative comparison on Vid4 for 4x VSR. Zoom in for the best view.

ablation study

optical flow

spatial-temporal convolutional self-attention

bidirectional optical flow-based feed-forward

number of frames

  人工智能 最新文章
2022吴恩达机器学习课程——第二课(神经网
第十五章 规则学习
FixMatch: Simplifying Semi-Supervised Le
数据挖掘Java——Kmeans算法的实现
大脑皮层的分割方法
【翻译】GPT-3是如何工作的
论文笔记:TEACHTEXT: CrossModal Generaliz
python从零学(六)
详解Python 3.x 导入(import)
【答读者问27】backtrader不支持最新版本的
上一篇文章      下一篇文章      查看所有文章
加:2022-03-11 22:11:31  更:2022-03-11 22:13:00 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年1日历 -2025/1/9 17:09:55-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码