学习目标
- 理论补充:C3D网络特点
- 动手能力 :用C3D提取目标数据集(Charades-STA、ActivityNet Captions、TVR)的滑动窗口(64、128、256、512祯,各个窗口间80%的重叠)特征
目标回答
- 特点1):使用3D卷积网络,能够在卷积过程中考虑时序信息,从而进行motion modeling;这是通过将frame视作通道来达到的,比如卷积核大小为
3
?
3
?
3
3*3*3
3?3?3,那么每次的深度感受野就是
3
3
3,也就是说,网络每次能在3帧之间建立起联系 特点2)使用同质的
3
?
3
?
3
3*3*3
3?3?3大小的卷积核,这种卷积核被证明是最有效的卷积核大小
- 见项目实施部分
网络结构
网络层 | 卷积核 | padding(conv) stride(pool) | 输入 | 输出 | 激活函数 |
---|
conv1 |
64
?
3
?
3
?
3
64*3*3*3
64?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
3
,
16
,
112
,
112
3, 16, 112, 112
3,16,112,112 |
64
,
16
,
112
,
112
64, 16, 112, 112
64,16,112,112 | ReLU | pool1 |
1
?
2
?
2
1*2*2
1?2?2 |
1
?
2
?
2
1*2*2
1?2?2 |
64
,
16
,
112
,
112
64, 16, 112, 112
64,16,112,112 |
64
,
16
,
56
,
56
64, 16, 56, 56
64,16,56,56 | | conv2 |
128
?
3
?
3
?
3
128*3*3*3
128?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
64
,
16
,
56
,
56
64, 16, 56, 56
64,16,56,56 |
128
,
16
,
56
,
56
128, 16, 56, 56
128,16,56,56 | ReLU | pool2 |
2
?
2
?
2
2*2*2
2?2?2 |
2
?
2
?
2
2*2*2
2?2?2 |
128
,
16
,
56
,
56
128, 16, 56, 56
128,16,56,56 |
128
,
8
,
28
,
28
128, 8, 28, 28
128,8,28,28 | | conv3a |
256
?
3
?
3
?
3
256*3*3*3
256?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
128
,
8
,
28
,
28
128, 8, 28, 28
128,8,28,28 |
256
,
8
,
28
,
28
256, 8, 28, 28
256,8,28,28 | ReLU | conv3b |
256
?
3
?
3
?
3
256*3*3*3
256?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
256
,
8
,
28
,
28
256, 8, 28, 28
256,8,28,28 |
256
,
8
,
28
,
28
256, 8, 28, 28
256,8,28,28 | ReLU | pool3 |
2
?
2
?
2
2*2*2
2?2?2 |
2
?
2
?
2
2*2*2
2?2?2 |
256
,
8
,
28
,
28
256, 8, 28, 28
256,8,28,28 |
256
,
4
,
14
,
14
256, 4, 14, 14
256,4,14,14 | | conv4a |
512
?
3
?
3
?
3
512*3*3*3
512?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
256
,
4
,
14
,
14
256, 4, 14, 14
256,4,14,14 |
512
,
4
,
14
,
14
512, 4, 14, 14
512,4,14,14 | ReLU | conv4b |
512
?
3
?
3
?
3
512*3*3*3
512?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
512
,
4
,
14
,
14
512, 4, 14, 14
512,4,14,14 |
512
,
4
,
14
,
14
512, 4, 14, 14
512,4,14,14 | ReLU | pool4 |
2
?
2
?
2
2*2*2
2?2?2 |
2
?
2
?
2
2*2*2
2?2?2 |
512
,
4
,
14
,
14
512, 4, 14, 14
512,4,14,14 |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 | | conv5a |
512
?
3
?
3
?
3
512*3*3*3
512?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 | ReLU | conv5b |
512
?
3
?
3
?
3
512*3*3*3
512?3?3?3 |
1
?
1
?
1
1*1*1
1?1?1 |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 | ReLU | pool5 |
2
?
2
?
2
2*2*2
2?2?2 |
2
?
2
?
2
2*2*2
2?2?2,
0
?
1
?
1
0*1*1
0?1?1(padding) |
512
,
2
,
7
,
7
512, 2, 7, 7
512,2,7,7 |
512
,
1
,
4
,
4
512, 1, 4, 4
512,1,4,4 | | view | | |
512
,
1
,
4
,
4
512, 1, 4, 4
512,1,4,4 |
1
,
8192
1, 8192
1,8192 | | fc6 | | |
1
,
8192
1, 8192
1,8192 |
1
,
4096
1, 4096
1,4096 | ReLU+dropout
0.5
0.5
0.5 | fc7 | | |
1
,
4096
1, 4096
1,4096 |
1
,
4096
1, 4096
1,4096 | ReLU+dropout
0.5
0.5
0.5 | fc8 | | |
1
,
4096
1, 4096
1,4096 |
1
,
487
1, 487
1,487 | softmax |
项目准备
- 下载写好的项目:c3d-pytorch
- 下载预训练好的特征(项目中也有)
- 修改
predict.py 代码
代码逐行注释
""" How to use C3D network. """
import numpy as np
import torch
from torch.autograd import Variable
from os.path import join
from glob import glob
import skimage.io as io
from skimage.transform import resize
from C3D_model import C3D
def get_sport_clip(clip_name, verbose=True):
"""
Loads a clip to be fed to C3D for classification.
TODO: should I remove mean here?
Parameters
----------
clip_name: str
the name of the clip (subfolder in 'data'). 此处为'roger'
verbose: bool
if True, shows the unrolled clip (default is True).
Returns
-------
Tensor
(batch_size, channels, frames, height, weight)
a pytorch batch (n, ch, fr, h, w).
"""
clip = sorted(glob(join('data', clip_name, '*.png')))
clip = np.array([resize(io.imread(frame), output_shape=(112, 200), preserve_range=True) for frame in clip])
clip = clip[:, :, 44:44+112, :]
if verbose:
clip_img = np.reshape(clip.transpose(1, 0, 2, 3), (112, 16 * 112, 3))
io.imshow(clip_img.astype(np.uint8))
io.show()
clip = clip.transpose(3, 0, 1, 2)
clip = np.expand_dims(clip, axis=0)
clip = np.float32(clip)
return torch.from_numpy(clip)
def read_labels_from_file(filepath):
"""
Reads Sport1M labels from file
Parameters
----------
filepath: str
the file.
Returns
-------
list
list of sport names.
"""
with open(filepath, 'r') as f:
labels = [line.strip() for line in f.readlines()]
return labels
def main():
"""
Main function.
"""
X = get_sport_clip('roger')
X = Variable(X)
X = X.cuda()
net = C3D()
net.load_state_dict(torch.load('c3d.pickle'))
net.cuda()
net.eval()
prediction = net(X)
prediction = prediction.data.cpu().numpy()
labels = read_labels_from_file('labels.txt')
top_inds = prediction[0].argsort()[::-1][:5]
print('\nTop 5:')
for i in top_inds:
print('{:.5f} {}'.format(prediction[0][i], labels[i]))
if __name__ == '__main__':
main()
项目实施
-
根据视频名称、窗口值
[
64
,
128
,
256
,
512
]
[64,128,256,512]
[64,128,256,512]和滑动值
[
13
,
26
,
51
,
102
]
[13,26,51,102]
[13,26,51,102],准备好视频的输入帧 -
修改网络输出,只要fc6的输出来表示视频特征 -
输出为.npy 格式的文件,存入特征、其他属性
|