IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 人工智能 -> Visual Attention Network -> 正文阅读

[人工智能]Visual Attention Network

paper
code

Abstract

demerits of self-attention (SA)

  • treat images as 1D sequence and neglect 2D structure
  • quadractic complexity is expensive for HR images
  • achieve spatial adaptability but ignore channel adaptability

demrit of multiple layer perceptron (MLP)

  • sensitive to input size and only process fixed-size images
  • consider global information but ignore local structure

contributions

  • propose large kernel attention (LKA)
    local structure information, long-range dependence, adaptability in channel dimension
  • present visual attention network (VAN) as backbone based on LKA
    SOTA performance, less parameters and FLOPS

2202_van_f1
Results of different models on ImageNet-1K validation set. Left: Comparing the performance of recent models DeiT, PVT, Swin Transformer, ConvNeXt, Focal Transformer and our VAN. All these models have a similar amount of parameters. Right: Comparing the performance of recent models and our VAN while keeping the computational cost similar.

Method

large kernel attention (LKA)

large-kernel convolution brings a huge amount of computational overhead and parameters
solution decompose a large kernel convolution

2202_van_f2
Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv), and a pointwise convolution ( 1 × 1 1\times1 1×1 Conv). The colored grids represent the location of convolution kernel and the yellow grid means the center point. The diagram shows that a 13 × 13 13\times13 13×13 convolution is decomposed into a 5 × 5 5\times5 5×5 depth-wise convolution, a 5 × 5 5\times5 5×5 depth-wise dilation convolution with dilation rate 3, and a pointwise convolution. Note: zero paddings are omitted in the above figure.

a K × K K\times K K×K large kernel convolution divided into 3 components

  • a spatial local convolution: ( 2 d ? 1 ) × ( 2 d ? 1 ) (2d-1)\times(2d-1) (2d?1)×(2d?1) depth-wise conv ?? ? ?? \implies ? local contextual information
  • a spatial long-range convolution: ? K d ? × ? K d ? \lceil\frac Kd\rceil\times\lceil\frac Kd\rceil ?dK??×?dK?? depth-wise dilated
    conv ?? ? ?? \implies ? large receptive field
  • a channel convolution: 1 × 1 1\times1 1×1 conv ?? ? ?? \implies ? adaptility in channel

where, K K K is kernel size, d d d is dilation

2202_van_t1
Desirable properties belonging to convolution, self-attention and LKA.

write LKA module as
A t t e n t i o n = C o n v 1 × 1 ( D W - D - C o n v ( D W - C o n v ( F ) ) ) O u t p u t = A t t e n t i o n ? F \begin{aligned} Attention&=Conv_{1\times1}(DW\text{-}D\text{-}Conv(DW\text{-}Conv(F))) \\ Output&=Attention\otimes F \end{aligned} AttentionOutput?=Conv1×1?(DW-D-Conv(DW-Conv(F)))=Attention?F?

where, F ∈ R C × H × W F\in\Reals^{C\times H\times W} FRC×H×W is input features, A t t e n t i o n ∈ R C × H × W Attention\in\Reals^{C\times H\times W} AttentionRC×H×W is attention map

2202_van_f3
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; ? the self-attention module; (d) a stage of our Visual Attention Network (VAN). “CFF” means convolutional feed-forward network. Residual connection is omitted in (d). The difference between (a) and (b) is the element-wise multiply. It is worth noting that ? is designed for 1D sequences.

computational complexity

assumed input and output with same size R C × H × W \Reals^{C\times H\times W} RC×H×W
P a r a m = ? K d ? × ? K d ? × C + ( 2 d ? 1 ) × ( 2 d ? 1 ) × C + C × C F L O P s = ( ? K d ? × ? K d ? × C + ( 2 d ? 1 ) × ( 2 d ? 1 ) × C + C × C ) × H × W \begin{aligned} \mathrm{Param}&=\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C \\ \mathrm{FLOPs}&=(\lceil\frac Kd\rceil\times\lceil\frac Kd\rceil\times C+(2d-1)\times(2d-1)\times C+C\times C)\times H\times W \end{aligned} ParamFLOPs?=?dK??×?dK??×C+(2d?1)×(2d?1)×C+C×C=(?dK??×?dK??×C+(2d?1)×(2d?1)×C+C×C)×H×W?

where, K K K is kernel size with default K = 21 K=21 K=21, d d d is dilation with default d = 3 d=3 d=3

2202_van_t3
Comparison of parameters of different manners for a 21 × 21 21\times21 21×21 convolution. X, Y and Our donate standard convolution, MobileNet decomposition and our decomposition, respectively. The input and output feature have the same size H × W × C H\times W\times C H×W×C. Note: Bias is omitted for simplifying format.

architecture variants

2202_van_t2
The detailed setting for different versions of the VAN. “e.r.” represents expansion ratio in the feed-forward network.

Experiment

image classification

dataset ImageNet-1K, with augmentation
optimizer AdamW: batchsize=1024, 310 epochs, momentum=0.9, weigh decy=5e-2, init lr=5e-4, warm-up, cosine decay

2202_van_t6
Compare with the state-of-the-art methods on ImageNet validation set. Params means parameter. GFLOPs donates floating point operations. Top-1 Acc represents Top-1 accuracy.

object detection and instance segmentation

framework RetinaNet, Mask R-CNN, Cascade Mask R-CNN, Sparse R-CNN
dataset COCO 2017

2202_van_t7
Object detection on COCO 2017 dataset. #P means parameter. RetinaNet 1 × 1\times 1× donates models are based on RetinaNet and we train them for 12 epochs.

2202_van_t8
Object detection and instance segmentation on COCO 2017 dataset. #P means parameter. Mask R-CNN 1 × 1\times 1× donates models are based on Mask R-CNN and we train them for 12 epochs. A P b AP^b APb and A P m AP^m APm refer to bounding box AP and mask AP respectively.

2202_van_t9
Comparison with the state-of-the-art vision backbones on COCO 2017 benchmark. All models are trained for 36 epochs. We calculate FLOPs with input size of 1280 × 800 1280\times800 1280×800.

semantic segmentation

framework Semantic FPN, UperNet
dataset ADE20K

2202_van_t10
Results of semantic segmentation on ADE20K validation set. The upper and lower part are obtained under two different training/validation schemes. We calculate FLOPs with input size 512 × 512 512\times512 512×512 for Semantic FPN and 2048 × 512 2048\times512 2048×512 for UperNet.

ablation studies

architecture components

2202_van_t4
Ablation study of different modules in LKA. Results show that each part is critical. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

  • local structural information, long-range dependence, adaptability in channel dimension are all critial
  • attention mechanism help network achieve adaptive property
kernel size and dilation

2202_van_t5
Ablation study of different kernel size in LKA. Acc(%) means Top-1 accuracy on ImageNet validation set.

key findings

  • decomposing a 21 × 21 21\times21 21×21 convolution work better than decomposing a 7 × 7 7\times7 7×7 convolution
    ?? ? ?? \implies ? large kernel is critical for visual tasks
  • decomposing a larger 35 × 35 35\times35 35×35 convolution, gain is not obvious comparing with decomposing a 21 × 21 21\times21 21×21 convolution

visualization

2202_van_f4
Visualization results. All images come from different categories in ImageNet validation set. CAM is produced by using Grad-CAM. We compare different CAMs produced by Swin-T, ConvNeXt-T and VAN-Base.

key findings

  • activation area is more accurate
  • show obvious advantages when object is dominant in images ?? ? ?? \implies ? ability to capture long-range dependence
  人工智能 最新文章
2022吴恩达机器学习课程——第二课(神经网
第十五章 规则学习
FixMatch: Simplifying Semi-Supervised Le
数据挖掘Java——Kmeans算法的实现
大脑皮层的分割方法
【翻译】GPT-3是如何工作的
论文笔记:TEACHTEXT: CrossModal Generaliz
python从零学(六)
详解Python 3.x 导入(import)
【答读者问27】backtrader不支持最新版本的
上一篇文章      下一篇文章      查看所有文章
加:2022-03-30 18:23:58  更:2022-03-30 18:24:56 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年11日历 -2024/11/26 12:52:05-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码