Segmentaion标签的三种表示：poly、mask、rle

不同于图像分类这样比较简单直接的计算机视觉任务，图像分割任务（又分为语义分割、实例分割、全景分割）的标签形式稍为复杂。在分割任务中，我们需要在像素级上表达的是一张图的哪些区域是哪个类别。

多边形坐标Polygon

第一感下，要表达图像中某个区域是什么类别，只要这个区域“圈起来”，并给它一个标签就好了。的确，用多边形来将目标圈出来确实是最符合我们视觉上对图像的感知的方法。并且在很多数据集的标注过程中，来自人类的手工标注也是通过给出一个一个点的坐标，从而形成一个闭合的多边形区域，从而实现对图像中目标物体的分割。

我们通过 OpenCV 的 polylines 函数来将这种做法画出来看一下：

import numpy as np
import cv2
cat_poly = [[390.56410256410254, 1134.179487179487], 
            # ...
            [407.2307692307692, 1158.5384615384614]]

dog_poly = [[794.4102564102564, 635.4615384615385], 
            # ...
            [780.3076923076923, 531.6153846153846]]

img = cv2.imread("cat-dog.jpeg")

cat_points = np.array(cat_poly, dtype=np.int32)
cv2.polylines(img, [cat_points], True, (255, 0, 0), 3)
dog_points  = np.array(dog_poly, dtype=np.int32)
cv2.polylines(img, [dog_points], True, (0, 0, 255), 3)

cv2.imshow("window", img)
cv2.waitKey(0)

这里的数据 cat_poly 是一个 $n\times 2$ 的二维数组，表示多边形框的 $n$ 个坐标，即 $x_1,y_1],[x_2,y_2],...[x_n,y_n]]$ 。画出来大概就是下面这样子：

在这里插入图片描述

这样的确可以划分出我们想要的区域，但是没有体现出“区域”的概念，即在整个多边形框内，都是猫/狗区域。

掩膜区域Mask

为了体现出区域的概念，我们可以将整个区域展示出来，这里用到 fillPoly 函数，就是下面这样大家常常见到的样子：

img = cv2.imread("cat-dog.jpeg")

dog_poly = [
  # ...
]
cat_poly = [
  # ...
]

cat_points = np.array(cat_poly, dtype=np.int32)
dog_points = np.array(dog_poly, dtype=np.int32)

zeros = np.zeros((img.shape), dtype=np.uint8)
mask = cv2.fillPoly(zeros, [cat_points], color=(255, 0, 0))
mask = cv2.fillPoly(zeros, [dog_points], color=(0, 0, 255))
mask_img = 0.5 * mask + img

cv2.imshow("window", mask_img)
cv2.waitKey(0)

在这里插入图片描述

在模型的设计与训练中，我们有时最后输出的就是与原图尺寸相同二值的 mask 图，其中 1 的地方表示该位置有某一类物体，0 表示没有该类物体。因此我们通常要将上面的多边形标注转为二值的 mask 图来作为直接用来计算损失的标签。由多边形标签转为掩膜标签的代码如下：

def poly2mask(points, width, height):
    mask = np.zeros((width, height), dtype=np.int32)
    obj = np.array([points], dtype=np.int32)
    cv2.fillPoly(mask, obj, 1)
    return mask

这里的 points 就是上面我们的 cat_poly 这样的二维数组的多边形数据。

就是将有该类物体的地方置为1，其他为0，有些区别会在语义分割和实例分割中有所不同，可能是某一类有一个mask，也可能是每一个实例一个 mask。大家按需调整即可。

将上述猫狗的例子转换后可视化如下：

width, height = img.shape[: 2]
cat_mask = poly2mask(cat_poly, width, height)
dog_mask = poly2mask(dog_poly, width, height)

注意，在做可视化时建议将上面的 poly2mask 函数中的 1 改为 255。因为灰度值为 1 也基本是黑的，但是在训练中为 1 即可。

在这里插入图片描述

从掩膜 mask 转换回多边形 poly 的函数会比较复杂，在这个过程中可能会有标签精度的损失。我们用越多的坐标点来表示掩膜自然也就越精确，极端情况下，将掩膜边缘处的每一个像素都连接起来，这时不会有精度的损失。但我们通常不会这样做。

这里给出转换的函数，该函数会返回一个数组，数组的长度就是 mask 中闭合区域的个数，数组的每个元素是一组坐标： $x_1,y_1,x_2,y_2,...,x_n,y_n]$ ，注意这里的坐标并不是成对的，与我们上面的数据输入略有不同，因此在下面的实验中，笔者用 get_paired_coord 函数统一了一下接口规范。

其中 tolerance 参数（中文意为容忍度）表示的就是输出的多边形的每个坐标点之间的最大距离，可想而知，该值越大，可能的精度损失越大。

from skimage import measure

def close_contour(contour):
    if not np.array_equal(contour[0], contour[-1]):
        contour = np.vstack((contour, contour[0]))
    return contour

def binary_mask_to_polygon(binary_mask, tolerance=0):
    """Converts a binary mask to COCO polygon representation
    Args:
    binary_mask: a 2D binary numpy array where '1's represent the object
    tolerance: Maximum distance from original points of polygon to approximated
    polygonal chain. If tolerance is 0, the original coordinate array is returned.
    """
    polygons = []
    # pad mask to close contours of shapes which start and end at an edge
    padded_binary_mask = np.pad(binary_mask, pad_width=1, mode='constant', constant_values=0)
    contours = measure.find_contours(padded_binary_mask, 0.5)
    contours = np.subtract(contours, 1)
    for contour in contours:
        contour = close_contour(contour)
        contour = measure.approximate_polygon(contour, tolerance)
        if len(contour) < 3:
            continue
        contour = np.flip(contour, axis=1)
        segmentation = contour.ravel().tolist()
        # after padding and subtracting 1 we may get -0.5 points in our segmentation
        segmentation = [0 if i < 0 else i for i in segmentation]
        polygons.append(segmentation)
    return polygons

下面看一下本例中的小狗在 tolerance 为 0 和 100 下的区别。

def get_paired_coord(coord):
    points = None
    for i in range(0, len(coord), 2):
        point = np.array(coord[i: i+2], dtype=np.int32).reshape(1, 2)
        if (points is None): points = point
        else: points = np.concatenate([points, point], axis=0)
    return points


poly_0 = binary_mask_to_polygon(cat_mask+dog_mask, tolerance=0)
poly_100 = binary_mask_to_polygon(cat_mask+dog_mask, tolerance=100)


poly0_0 = get_paired_coord(poly_0[0])		# poly_0[0]是小狗，poly[1]是小猫
poly100_0 = get_paired_coord(poly_100[0])

p0_img = img
p0_points = np.array(poly0_0, dtype=np.int32)
cv2.polylines(p0_img, [p0_points], True, (255, 0, 0), 3)
cv2.imwrite("poly_dog_0.jpeg", p0_img)

p100_img = cv2.imread("cat-dog.jpeg")
p100_points = np.array(poly100_0, dtype=np.int32)
cv2.polylines(p100_img, [p100_points], True, (255, 0, 0), 3)
cv2.imwrite("poly_dog_100.jpeg", p100_img)

在这里插入图片描述

tolerance=0

在这里插入图片描述

tolerance=100

与我们的预期相符，tolerance=0 时不会有精度损失，而当 tolerance=100 时可以看到进度损失已经比较大了。

RLE编码

mask 大概是这种形式：

mask=np.array(
        [
            [0, 0, 0, 0, 0, 0, 0, 0],
            [0, 0, 1, 1, 0, 0, 1, 0],
            [0, 0, 1, 1, 1, 1, 1, 0],
            [0, 0, 1, 1, 1, 1, 1, 0],
            [0, 0, 1, 1, 1, 1, 1, 0],
            [0, 0, 1, 0, 0, 0, 1, 0],
            [0, 0, 1, 0, 0, 0, 1, 0],
            [0, 0, 0, 0, 0, 0, 0, 0]
        ]
    )

可以看到其实是有很多信息冗余的，因为只有0，1两种元素，RLE编码就是将相同的数据进行压缩计数，同时记录当前数据出现的初始为位置和对应的长度，例如：[0,1,1,1,0,1,1,0,1,0] 编码之后为1,3,5,2,8,1。其中的奇数位表示数字1出现的对应的index，而偶数位表示它对应的前面的坐标位开始数字1重复的个数。

RLE全称（run-length encoding），翻译为游程编码，又译行程长度编码，又称变动长度编码法（run coding），在控制论中对于二值图像而言是一种编码方法，对连续的黑、白像素数(游程)以不同的码字进行编码。游程编码是一种简单的非破坏性资料压缩法，其好处是加压缩和解压缩都非常快。其方法是计算连续出现的资料长度压缩之。

RLE是COCO数据集的规范格式之一，也是许多图像分割比赛指定提交结果的格式。

mask转rle编码，这里我们借助 pycocotools 工具包：

def singleMask2rle(mask):
    rle = mask_util.encode(np.array(mask[:, :, None], order='F', dtype="uint8"))[0]
    rle["counts"] = rle["counts"].decode("utf-8")
    return rle