开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> Transformer系列：Pyramid Vision Transformer （ICCV2021） -> 正文阅读

[人工智能]Transformer系列：Pyramid Vision Transformer （ICCV2021）

1. Motivation

ViT输出的feature map是single-scale和low-resolution的，难以用于pixel-level dense prediction任务中（比如目标检测和分割）。对于常用的图像输入大小（COCO中的800 shorter edge），ViT的计算量和占用内存很大。

2. Contribution

提出PVT，可以代替CNN用于很多下游任务中，包括image-level和pixel-level prediction，是第一个纯transformer的backbone。1）PVT能输入很小的image patch（4*4），以学习高分辨率的特征。2）为了减小计算量，引入逐渐缩小的金字塔结构，即随着网络加深，减少sequence length。3）采用spatial-reduction attention (SRA) 进一步减少资源消耗。

3. Methods

?3.1?Overall Architecture

Patch embedding：将图像划分为4*4大小的patch，拉成向量，经过linear projection，得到patch embedding（疑问：本文声称没有卷积层，但这不就是核为4*4，stride为4的卷积层吗？）。

加上position embedding后输入至transformer encoder block。一共四个block，将feature map大小逐渐降4，8，16，32倍（相对于原图大小）。

3.2?Feature Pyramid for Transformer

CNN中使用stride得到multi-scale feature map，本文首先将第i个stage的feature map划分为Pi * Pi大小的patch，每个patch拉成向量，映射成Ci维度的embedding，最后得到H_{i-1}/Pi? * W_{i-1}/Pi * Ci 大小的feature map 。（其实就是带stride的卷积罢了）

3.3?Transformer Encoder

为了降低计算量，用spatial-reduction attention（SRA）代替muti-head attention（MHA），即在做attention前，先对key和value做spatial reduction（SR），其实就是核和stride均为Ri的卷积，跟着一层layer normalization。

?SRA过程公式化为

?3.4?Model Details

4.??Experiments:

4.1?Image Classification

Dataset：ImageNet-1K

Optimizer： AdamW

Epoch： 300

Batch size：128

Learning rate：1*10^-3，cosine learning rate decay

Data augmentation: follow DeiT, including random cropping, random flipping, label smoothing,

Mixup, CutMix and random erasing.? (224 * 224 center crop on the validation set)

Hardware: 8 V100 GPUs

4.2?Object Detection

Dataset: COCO2017

Setting: pretraining on ImageNet-1K and fine-tuning on COCO

Framework : RetinaNet and Mask R-CNN

Optimizer： AdamW

Epoch :? 1× with 12 epochs and 3× with 36 epochs.

Batch size：16

Learning rate：1*10^-4

Hardware: 8 V100 GPUs

4.3??Semantic Segmentation

Dataset: ADE20K

Optimizer： AdamW

Setting: pretraining on ImageNet-1K

Framework : Semantic FPN

Epoch : 80K iterations?

Batch size：16

Learning rate：1 * 10^-4, polynomial decay schedule with a power of 0.9

Data augmentation:? randomly resize and crop the image to 512 * 512 for training,? and rescale to have a shorter side of 512 pixels during testing

Hardware: 4 V100 GPUs

4.4??Pure Transformer Detection & Segmentation

4.4.1 PVT+DETR

Dataset: COCO2017

Epoch：50

Learning rate： 1*10^-4, divided by 10 at the 33rd epoch

Data augmentation: random flipping and multi-scale training

Other： same as those in object detection section

4.4.2?PVT+Trans2Seg

PVT+Trans2Seg

Dataset: ADE20K

Epoch : 40K iterations?

4.5??Ablation Study

Dataset: ImageNet and COCO

Setting: Settings on ImageNet are the same as the settings in image classification section.? For COCO, all models are trained with a 1x training schedule (i.e., 12 epochs) and without multi-scale training, and other settings follow those in object detection section.