论文精读PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION
ABSTRACT 摘要 We present Pix2Seq, a simple and generic framework for object detection. Unlike existing approaches that explicitly integrate prior knowledge about the task, we cast object detection as a language modeling task conditioned on the observed pixel inputs. Object descriptions (e.g., bounding boxes and class labels) are expressed as sequences of discrete tokens, and we train a neural network to perceive the image and generate the desired sequence. Our approach is based mainly on the intuition that if a neural network knows about where and what the objects are, we just need to teach it how to read them out. Beyond the use of task-specific data augmentations, our approach makes minimal assumptions about the task, yet it achieves competitive results on the challenging COCO dataset, compared to highly specialized and well optimized detection algorithms. 我们提出 Pix2Seq,一个简单而通用的对象检测框架。不像现有的方法明确地整合了关于任务的先验知识,我们将对象检测作为语言建模任务,以观察到的像素为条件输入。表达对象描述(例如,边界框和类标签)作为离散标记的序列,我们训练一个神经网络来感知图像并生成所需的序列。我们的方法主要基于直觉上,如果神经网络知道对象在哪里和是什么,我们只需要教它如何读出它们。除了使用特定于任务的数据增强,我们的方法对任务做了最小的假设,但它在具有挑战性的 COCO 数据集上取得了具有竞争力的结果,相比于高度专门和优化的检测算法。
1 INTRODUCTION
Visual object detection systems aim to recognize and localize all objects of pre-defined categories in an image. The detected objects are typically described by a set of bounding boxes and associated class labels. Given the difficulty of the task, most existing methods, such as (Girshick, 2015; Ren et al., 2015; He et al., 2017; Lin et al., 2017b; Carion et al., 2020), are carefully designed and highly customized, with a significant amount of prior knowledge in the choice of architecture and loss function. For example, many architectures are tailored to the use of bounding boxes (e.g., with region proposals (Girshick, 2015; Ren et al., 2015) and RoI pooling (Girshick et al., 2014; He et al., 2017)). Others are tied to the use of object queries for object binding (Carion et al., 2020). Loss functions are often similarly tailored to the use of bounding boxes, such as box regression (Szegedy et al., 2013; Lin et al., 2017b), set-based matching (Erhan et al., 2014; Carion et al., 2020), or by incorporating specific performance metrics, like intersection-over-union on bounding boxes (Rezatofighi et al., 2019). Although existing systems find applications in myriad domains, from self-driving cars (Sun et al., 2020), to medical image analysis (Jaeger et al., 2020), to agriculture (Sa et al., 2016), the specialization and complexity make them difficult to integrate into a larger system, or generalize to a much broader array of tasks associated with general intelligence. This paper advocates a new approach, based on the intuition that if a neural net knows about where and what the objects are, we just need to teach it to read them out. And by learning to “describe” objects the model can learn to ground the “language” on pixel observations, leading to useful object representations. This is realized with our Pix2Seq framework (see Figure 1). Given an image, our model produces a sequence of discrete tokens that correspond to object descriptions (e.g., object bounding boxes and class labels), reminiscent of an image captioning system (Vinyals et al., 2015b; Karpathy & Fei-Fei, 2015; Xu et al., 2015). In essence, we cast object detection as a language modeling task conditioned on pixel inputs, for which the model architecture and loss function are generic and relatively simple, without being engineered specifically for the detection task. As such, one can readily extend the framework to different domains or applications, or incorporate it into a perceptual system supporting general intelligence, for which it provides a language interface to a wide range of vision tasks. To tackle the detection task with Pix2Seq, we first propose a quantization and serialization scheme that converts bounding boxes and class labels into sequences of discrete tokens. We then leverage an encoder-decoder architecture for perceiving pixel inputs and generating the target sequence. The objective function is simply the maximum likelihood of tokens conditioned on pixel inputs and the preceding tokens. While both the architecture and loss function are task-agnostic (without assuming prior knowledge about object detection, e.g., bounding boxes), we can still incorporate task-specific prior knowledge with a sequence augmentation technique, proposed below, that alters both input and target sequences during training. Through extensive experimentation, we demonstrate that this simple Pix2Seq framework can achieve competitive results on the COCO dataset compared to highly customized, well established approaches, including Faster R-CNN (Ren et al., 2015) and DETR (Carion et al., 2020). By pretraining our model on a larger object detection dataset, its performance can be further improved. 视觉对象检测系统旨在识别和定位所有预定义类别的对象一个图像。检测到的对象通常由一组边界框和相关联的类标签。鉴于任务的难度,大多数现有的方法,例如 (Girshick, 2015; Ren等人,2015;他等人,2017;林等人,2017b; Carion et al., 2020),经过精心设计,高度定制的,在架构选择和损失方面有大量的先验知识功能。例如,许多架构都是为使用边界框(例如,具有区域建议(Girshick,2015;Ren 等人,2015)和 RoI 池化(Girshick 等人,2014;He 等人,2017))。其他的则与使用对象查询进行对象绑定有关(Carion 等人,2020)。损失函数是通常类似地针对边界框的使用量身定制,例如框回归(Szegedy et al., 2013;Lin et al., 2017b)、基于集合的匹配 (Erhan et al., 2014; Carion et al., 2020),或者通过合并特定的性能指标,例如边界框上的交并联合(Rezatofighi 等人, 2019)。尽管现有系统在无数领域都有应用,但从自动驾驶汽车(Sunet al., 2020)、医学图像分析 (Jaeger et al., 2020)、农业 (Sa et al., 2016)、专业化和复杂性使它们难以集成到更大的系统中,或推广到与通用智能相关的更广泛的任务。 本文提倡一种新方法,基于直觉,如果神经网络知道在哪里以及对象是什么,我们只需要教它读出它们。通过学习“描述”模型可以学习的对象将“语言”建立在像素观察上,从而产生有用的对象申述。这是通过我们的 Pix2Seq 框架实现的(参见图 1)。给定一张图片,我们的模型生成一系列离散标记,这些标记对应于对象描述(例如,对象边界框和类标签),让人联想到图像字幕系统(Vinyals 等人,2015b;卡帕斯和飞飞,2015;徐等人,2015)。本质上,我们将对象检测作为一种语言以像素输入为条件的建模任务,其模型架构和损失函数是通用且相对简单,没有专门为检测任务设计。像这样,可以很容易地将框架扩展到不同的领域或应用程序,或者将其合并到一个支持通用智能的感知系统,它为此提供了一个语言接口广泛的视觉任务。为了解决 Pix2Seq 的检测任务,我们首先提出了一种量化和序列化方案它将边界框和类标签转换为离散标记序列。然后我们利用用于感知像素输入并生成目标序列的编码器-解码器架构。这目标函数只是以像素输入和前面的令牌。虽然架构和损失函数都与任务无关(不假设关于对象检测的先验知识,例如边界框),我们仍然可以结合特定任务使用下面提出的序列增强技术的先验知识,它改变了两个输入和训练期间的目标序列。通过广泛的实验,我们证明了这个简单的 Pix2Seq 框架可以在 COCO 数据集上取得有竞争力的结果高度定制的、成熟的方法,包括 Faster R-CNN (Ren et al., 2015) 和DETR(Carion 等人,2020 年)。通过在更大的对象检测数据集上预训练我们的模型,它的性能可以进一步提高。
2 THE PIX2SEQ FRAMEWORK
In the proposed Pix2Seq framework we cast object detection as a language modeling task, conditioned on pixel inputs (Figure 1). The system consists of four main components (Figure 2): ? Image Augmentation: As is common in training computer vision models, we use image augmentations to enrich a fixed set of training examples (e.g., with random scaling and crops). ? Sequence construction & augmentation: As object annotations for an image are usually representedas a set of bounding boxes and class labels, we convert them into a sequence of discrete tokens. ? Architecture: We use an encoder-decoder model, where the encoder perceives pixel inputs, and the decoder generates the target sequence (one token at a time). ? Objective/loss function: The model is trained to maximize the log likelihood of tokens conditioned on the image and the preceding tokens (with a softmax cross-entropy loss). 图 2:Pix2Seq 学习框架的主要组件
2.1 SEQUENCE CONSTRUCTION FROM OBJECT DESCRIPTIONS
In common object detection datasets, such as Pascal VOC (Everingham et al., 2010), COCO (Linet al., 2014), and Open Images (Kuznetsova et al., 2020), images have variable numbers of objects,represented as sets of bounding boxes and class labels. In Pix2Seq we express them as sequences of discrete tokens. While class labels are naturally expressed as discrete tokens, bounding boxes are not. A bounding box is determined by two of its corner points (i.e., top-left and bottom-right), or by its center point plus height and width. We propose to discretize the continuous numbers used to specify the x; y coordinates of corner points (similarly for height and width if the other box format is used). Specifically, an object is represented as a sequence of five discrete tokens, i.e. [ymin; xmin; ymax; xmax; c], where each of the continuous corner coordinates is uniformly discretized into an integer between [1; nbins ], and c is the class index. We use a shared vocabulary for all tokens, so the vocabulary size is equal to number of bins + number of classes. This quantization scheme for the bounding boxes allows us to use a small vocabulary while achieving high precision. For example, a 600×600 image requires only 600 bins to achieve zero quantization error. This is much smaller than modern language models with vocabulary sizes of 32K or higher (Radford et al., 2018; Devlin et al., 2018). The effect of different levels of quantization on the placement of bounding boxes is illustrated in Figure 3. With each object description expressed as a short discrete sequence, we next need to serialize multiple object descriptions to form a single sequence for a given image. Since order of objects does not matter for the detection task per se, we use a random ordering strategy (randomizing the order objects each time an image is shown). We also explore other deterministic ordering strategies, but we hypothesize that random ordering will work just as well as any deterministic ordering, given a capable neural net and autoregressive modeling (where the net can learn to model the distribution of remaining objects conditioned on those observed). Finally, because different images often have different numbers of objects, the generated sequences will have different lengths. To indicate the end of a sequence, we therefore incorporate an EOS token. The sequence construction process with different ordering strategies is illustrated in Figure 4.
在常见的目标检测数据集中,例如 Pascal VOC (Everingham et al., 2010)、COCO (Linet al., 2014) 和 OpenImages (Kuznetsova et al., 2020),图像具有可变数量的对象,表示为一组边界框和类标签。在 Pix2Seq 中,我们将它们表示为离散令牌。 虽然类标签自然地表示为离散标记,但边界框不是。一个边界框由它的两个角点(即左上角和右下角)确定,或者由它的中心点加上高度和宽度。我们建议离散化用于指定 x 的连续数; y坐标角点的数量(如果使用其他框格式,高度和宽度也类似)。具体来说,一个对象表示为五个离散标记的序列,即 [ymin; xmin;最大;最大; c],其中每个的连续角坐标被均匀离散为[1; nbins],和c 是类索引。我们对所有标记使用共享词汇表,因此词汇表大小等于箱数 + 类数。边界框的这种量化方案允许我们使用小词汇量,同时实现高精度。例如,一张 600×600 的图像只需要600 个 bin 实现零量化误差。这比现代语言模型小得多32K 或更高的词汇量(Radford 等人,2018;Devlin 等人,2018)。不同的效果边界框放置的量化级别如图 3 所示。 将每个对象描述表示为一个短的离散序列,接下来我们需要序列化多个对象描述以形成给定图像的单个序列。由于对象的顺序无关紧要对于检测任务本身,我们使用随机排序策略(随机排序每个对象显示图像的时间)。我们还探索了其他确定性排序策略,但我们假设给定一个有能力的神经网络,随机排序将与任何确定性排序一样有效和自回归建模(网络可以学习对剩余对象的分布进行建模以观察到的为条件)。 最后,由于不同的图像往往有不同数量的物体,生成的序列会有不同的长度。为了指示序列的结束,我们因此合并了一个 EOS 令牌。不同排序策略的序列构建过程如图 4 所示。
2.2 ARCHITECTURE, OBJECTIVE AND INFERENCE
Treating the sequences that we construct from object descriptions as a “dialect”, we turn to generic architectures and objective functions that have been effective in language modeling. Architecture We use an encoder-decoder architecture. The encoder can be a general image encoder that perceives pixels and encodes them into hidden representations, such as a ConvNet (LeCun et al., 1989; Krizhevsky et al., 2012; He et al., 2016), Transformer (Vaswani et al., 2017; Dosovitskiy et al., 2020), or their combination (Carion et al., 2020). For generation we use a Transformer decoder, widely used in modern language modeling (Radford et al., 2018; Raffel et al., 2019). It generates one token at a time, conditioned on the preceding tokens and the encoded image representation. This removes the complexity and customization in architectures of modern object detectors, e.g., bounding box proposal and regression, since tokens are generated from a single vocabulary with a softmax. Objective Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss, i.e.,
where x is a given image, y and y~ are input and target sequences associated with x, and L is the target sequence length. y and y~ are identical in the standard language modeling setup, but they can also be different (as in our later augmented sequence construction). Also, wj is a pre-assigned weight for j-th token in the sequence. We set wj = 1; 8j, however it would be possible to weight tokens by their types (e.g., coordinate vs class tokens), or by the size of the corresponding object. Inference At inference time, we sample tokens from model likelihood, i.e., P(yjjx; y1:j-1). This can be done by either taking the token with the largest likelihood (arg max sampling), or using other stochastic sampling techniques. We find that using nucleus sampling (Holtzman et al., 2019) leads to higher recall than arg max sampling (Appendix C). The sequence ends when the EOS token is generated. Once the sequence is generated, it is straight-forward to extract and de-quantize the object descriptions (i.e., obtaining the predicted bounding boxes and class labels).
将我们从对象描述构造的序列视为“方言”,我们转向泛型在语言建模中有效的架构和目标函数。 架构我们使用编码器-解码器架构。编码器可以是通用的图像编码器感知像素并将它们编码为隐藏表示,例如 ConvNet(LeCun 等人,1989; Krizhevsky 等人,2012 年; He et al., 2016), Transformer (Vaswani et al., 2017; Dosovitskiy et al.,2020)或它们的组合(Carion 等人,2020)。对于生成,我们使用 Transformer 解码器,广泛用于现代语言建模(Radford et al., 2018; Raffel et al., 2019)。它生成一次一个令牌,以前面的令牌和编码的图像表示为条件。这个消除了现代目标检测器架构中的复杂性和定制性,例如边界框提议和回归,因为令牌是从具有 softmax 的单个词汇表中生成的。目标与语言建模类似,Pix2Seq 被训练来预测标记,给定图像和前面的令牌,具有最大似然损失,即
其中 x 是给定图像,y 和 y~ 是与 x 关联的输入和目标序列,L 是目标序列长度。 y 和 y~ 在标准语言建模设置中是相同的,但它们可以也不同(如我们后面的增强序列构造)。此外,wj 是预先分配的权重对于序列中的第 j 个标记。我们设置 wj = 1; 8j,但是可以通过它们的类型(例如,坐标与类标记)或相应对象的大小。
将目标检测作为一个语言建模任务,输入是像素点,输出的bbox和label使用离散的token来表示。首先根据精度与图片分辨率生成序列,将目标检测每个目标输出为一个序列[xmin, ymin, xmax, ymax, label]。使用容量为bins+nums classes大小的共享词汇表来表示。一幅600×600大小的图像只需要600bins,这里的bins可以理解为max(width,height),当然不局限于可以有出入,比如640的图像设置bins是500那么一个bin就代表1.3个像素点,实验中序列编码的顺序采用随机编码。首先将图片进行增强后,输入至encoder,首先使用预训练的deit进行处理,分类层使用nn.Identity(),目的是进行特征提取,之后进行池化。将提取的特征输入至Decoder中。对于生成器部分,我们采用语言模型中常用的Transformer解码器:它一次生成一个token,条件是前面的token和编码后的图像表示。这种方式消除了已有目标检测器架构的复杂性与定制化功能,比如bbox proposal、regression等。
序列增强技术: 由于所提方案的任务不可知性,很难对精度-召回率进行均衡调整。为缓解该问题,引入了一种序列增广技术,集成关于任务的先验信息 。传统自回归语言模型中的目标序列与输入序列相同;经过序列增广后,不仅对输入序列进行增广,同时还对目标序列进行修改使其能辨别噪声token。这种处理方式可以有效提升模型的鲁棒性。完成噪声目标合成与离散化后,我们将其添加到原始输入序列尾部;对于目标序列,我们设置噪声目标的目标token为noise类、坐标token为n/a,对应的损失权值为0。基于上述序列增广,可以极大的延迟EOS词的生成,提升召回率且不会提升噪声/重复预测。让模型预测最大长度序列,生成定长目标列表。然后,从生成的目标序列中提取边框与类别信息,采用最大似然真实类别标签替换noise类别标签。
在推理阶段,在序列的开头给出一个prompt,即,,然后从模型的似然分布中采样token。作者目前使用核采样,但也可以使用其他技术,如集束搜索。token一旦生成后,就可以用于为每个任务进行解码。与不同的任务需要特定的token化方案来生成token序列相似,解码过程也是针对每个任务而言的。下面给出了每个任务的推理解码过程的更详细的描述。
- 对于边界框,本文遵循Pix2Seq方法,将预测序列分解为具有5个token的元组,以获得坐标token和类别token,并对坐标token进行反量化以获得边界框。
- 对于实例分割,本文对每个多边形对应的坐标token进行反量化,然后将它们转换为密集的掩码。模型本身没有使用任何几何特定的正则化器进行训练,因此输出的多边形掩码可能带有噪声。为了减少噪声,作者发现对多个序列进行采样,然后对掩码进行平均,再通过一个简单的阈值得到一个二进制掩码是很有帮助的。
- 对于关键点检测,本文直接对关键点的图像坐标token进行反量化。
- 对于图像描述,本文直接将预测的离散token映射到文本中。
|