| |
|
开发:
C++知识库
Java知识库
JavaScript
Python
PHP知识库
人工智能
区块链
大数据
移动开发
嵌入式
开发工具
数据结构与算法
开发测试
游戏开发
网络协议
系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程 数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁 |
-> 人工智能 -> [2106] Video Super-Resolution Transformer -> 正文阅读 |
|
[人工智能][2106] Video Super-Resolution Transformer |
paper ContentAbstractcomponents in traditional Transformer design and their limitations
main contributions of VSR-Transformer
Preliminarynotation defination 1 (function distance) given a function
f
:
R
d
×
n
→
R
d
×
n
f: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}
f:Rd×n→Rd×n and a target function
f
?
:
R
d
×
n
→
R
d
×
n
f^{\ast}: \mathbb{R}^{d\times n}\rightarrow\mathbb{R}^{d\times n}
f?:Rd×n→Rd×n, we define a distance between these 2 function as for ground truth Y = f ? ( D ) Y=f^{\ast}(\mathcal{D}) Y=f?(D), loss denoted by L D ( f ) \mathcal{L}_\mathcal{D}(f) LD?(f) defination 2 (k-pattern function) a function
f
:
X
→
Y
f: \mathcal{X}\rightarrow\mathcal{Y}
f:X→Y is a k-pattern if for some
g
:
{
±
}
k
→
Y
g: {\{\pm\}}^k\rightarrow\mathcal{Y}
g:{±}k→Y and index
j
?
:
f
(
)
=
g
(
x
j
?
,
.
.
.
,
j
?
+
k
)
j^{\ast}: f()=g(x_{j^{\ast, ..., j^{\ast}+k}})
j?:f()=g(xj?,...,j?+k?). we call a function
h
u
,
W
(
x
)
=
∑
j
?
u
(
j
)
,
v
W
(
j
)
?
h_{\mathbf{u}, \mathbf{W}}(\mathbf{x})=\sum_{j}\langle {\mathbf{u}}^{(j)}, {\mathbf{v}}_{\mathbf{W}}^{(j)}\rangle
hu,W?(x)=∑j??u(j),vW(j)?? can learn a k-pattern function from a feature
v
W
(
j
)
{\mathbf{v}}_{\mathbf{W}}^{(j)}
vW(j)? of data
x
x
x with a layer
u
(
j
)
∈
R
q
{\mathbf{u}}^{(j)}\in R^q
u(j)∈Rq if for
?
>
0
\epsilon>0
?>0, we have feature v W ( j ) {\mathbf{v}}_{\mathbf{W}}^{(j)} vW(j)? learned by a convolutional attention network or a fully connected attention network parameterized by W \mathbf{W} W ?? ? ?? \implies ? any function that can capture locality of data mean it should learn a k-pattern function video super-resolution (VSR)given a LR video sequence
{
V
1
,
.
.
.
,
V
T
}
~
D
\{V_1, ..., V_T\}\sim\mathcal{D}
{V1?,...,VT?}~D, where
V
t
∈
R
3
×
H
×
W
V_t\in\mathbb{R}^{3\times H\times W}
Vt?∈R3×H×W is t-th LR frame,
D
\mathcal{D}
D is a distribution of videos given ground-truth HR frames
Y
=
{
Y
1
,
.
.
.
,
Y
T
}
\mathcal{Y}=\{Y_1, ..., Y_T\}
Y={Y1?,...,YT?}, where
Y
t
Y_t
Yt? is t-th HR frame where, d ( ? , ? ) d(\cdot, \cdot) d(?,?) is a distance metric, such as L1-loss, L2-loss, Charbonnier loss for VSR tasks, a sequence method can be used, such as RNN, LSTM, Transformer transformer blockgiven an input feature
X
∈
R
d
×
n
X\in\mathbb{R}^{d\times n}
X∈Rd×n (
d
d
d-dimensional embeddings of
n
n
n tokens) where,
W
o
i
∈
R
d
×
m
W_o^i\in\mathbb{R}^{d\times m}
Woi?∈Rd×m is a linear layer,
W
v
i
,
W
k
i
,
W
q
i
∈
R
m
×
d
W_v^i, W_k^i, W_q^i\in\mathbb{R}^{m\times d}
Wvi?,Wki?,Wqi?∈Rm×d are linear layers mapping feature to value, key, query,
h
h
h is heads number,
m
m
m is head size where, W 1 ∈ R r × d , W 2 ∈ R d × r W_1\in\mathbb{R}^{r\times d}, W_2\in\mathbb{R}^{d\times r} W1?∈Rr×d,W2?∈Rd×r are linear layers, b 1 ∈ R r , b 2 ∈ R d b_1\in\mathbb{R}^r, b_2\in\mathbb{R}^d b1?∈Rr,b2?∈Rd are bias, r r r is hidden layer size of feed-forward layer Methodmodel architecture
feature extractor capture features from LR input loss function Charbonnier loss
T frames number, C channels number, H image height, W image width spatial-temporal convolutional self-attention (STCSA)drawbacks of FCSAQ: whether FCSA layer learn k-patterns with gradient descent from theorem 1:
?? ? ?? \implies ? FCSA layer cannot use spatial information of each frame since local information not encoded in embeddings of all tokens detailed structure in STCSA
given feature maps of input video frames
X
∈
R
T
×
C
×
H
×
W
X\in\mathbb{R}^{T\times C\times H\times W}
X∈RT×C×H×W where,
W
q
,
W
k
,
W
v
W_q, W_k, W_v
Wq?,Wk?,Wv? are 3 independent conv, with kernel_size=3, stride=1, padding=1 where,
n
_
p
a
t
c
h
e
s
=
H
W
H
p
W
p
n\_patches=\frac{HW}{H_pW_p}
n_patches=Hp?Wp?HW? is patches number in each frame,
d
i
m
=
C
H
p
W
p
dim=CH_pW_p
dim=CHp?Wp? is dimension of each patch,
n
_
h
e
a
d
s
n\_heads
n_heads is heads number where,
d
=
C
H
p
W
p
n
_
h
e
a
d
s
d=\frac{CH_pW_p}{n\_heads}
d=n_headsCHp?Wp?? is hidden dimension step 5: obtain final features, and achieve output with a skip connection and a normalization f 1 ( X ) = L N ( X + F ) ∈ R T × C × H × W f_1(X)=LN(X+F)\in\mathbb{R}^{T\times C\times H\times W} f1?(X)=LN(X+F)∈RT×C×H×W where, W o W_o Wo? is a conv, with kernel_size=3, stride=1, padding=1 step 2 to step 4 inspired by COLA-Net where, κ 1 ( ? ) , κ 2 ( ? ) \kappa_1(\cdot), \kappa_2(\cdot) κ1?(?),κ2?(?) are unfold and fold operation, h h h is heads number which set h = 1 h=1 h=1 for good performance why STCSA is suitableQ: how STCSA layer learn k-patterns with gradient descent from theorem 2:
?? ? ?? \implies ? STCSA layer with gradient descent can capture locality of each frame spatial-temporal position encodingVSRT is permutation-invariant, thus requiring precise spatial-temporal position information where, α k = 1 / 100 0 2 k / d 3 {\alpha}_k=1/1000^{2k/\frac{d}3} αk?=1/10002k/3d?, k k k is an integer in [ 0 , k 6 ) [0, \frac{k}6) [0,6k?), p o s pos pos is position in corresponding dimension, and d d d is channel dimension size bidirectional optical flow-based feed-forward (BOFF)
given features
X
∈
R
T
×
C
×
H
×
W
X\in\mathbb{R}^{T\times C\times H\times W}
X∈RT×C×H×W output by STCSA layer where,
O
←
,
O
→
∈
R
T
×
2
×
H
×
W
\overleftarrow{O}, \overrightarrow{O}\in\mathbb{R}^{T\times2\times H\times W}
O,O∈RT×2×H×W are backward and forward optical flows;
s
p
y
(
?
,
?
)
spy(\cdot, \cdot)
spy(?,?) is a function as SPyNet which is pre-trained and updated in training where,
X
←
,
X
→
∈
R
T
×
C
×
H
×
W
\overleftarrow{X}, \overrightarrow{X}\in\mathbb{R}^{T\times C\times H\times W}
X,X∈RT×C×H×W are backward and forward features where,
[
?
,
?
]
[\cdot, \cdot]
[?,?] is to aggregation operator,
W
1
←
,
W
2
←
,
W
1
→
,
W
2
→
\overleftarrow{W_1}, \overleftarrow{W_2}, \overrightarrow{W_1}, \overrightarrow{W_2}
W1??,W2??,W1??,W2?? are weights of backward and forward networks where, R 1 , R 2 R_1, R_2 R1?,R2? are flexible networks Experimenttraining dataset
testing dataset
experiment detail
result on REDS
key findings
result on Vimeo-90K
key findings
result on Vid4
ablation studyoptical flow spatial-temporal convolutional self-attention bidirectional optical flow-based feed-forward number of frames |
|
|
上一篇文章 下一篇文章 查看所有文章 |
|
开发:
C++知识库
Java知识库
JavaScript
Python
PHP知识库
人工智能
区块链
大数据
移动开发
嵌入式
开发工具
数据结构与算法
开发测试
游戏开发
网络协议
系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程 数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁 |
360图书馆 购物 三丰科技 阅读网 日历 万年历 2025年1日历 | -2025/1/9 17:09:55- |
|
网站联系: qq:121756557 email:121756557@qq.com IT数码 |