[BCNet] Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers

[Draft] BCNet - CVPR 2021

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers - Obsidian Publish

Introduction

Mask R-CNN 형태의 instance segmentation은 box prediction을 수행 후 instance masks를 추출하는 과정을 거친다.
하지만 각 instance에 대해 개별적으로 추출된 ROI feature에서 regression되는 구조는 overlap된 objects, 특히 같은 class에 속한 objects가 overlap되었을 때, 많은 segmentation error를 확인할 수 있다.
- 이러한 문제를 해결하기 위해 NMS와 추가적인 후처리(post processing)를 추가한 모델이 제안되었으나, 경계가 과하게 smoothing되거나 instance간의 약간의 gap이 발생하는 문제 가 있었다.
- (Fig x의 (d)와 같ASN과 같이 amodal/occlusion mask prediction을 위한 network의 경우 겹침이 발생한 object(occludee)에만 집중하여 성능의 한계가 존재한다.
  
  Instance Segmentation on COCO (source: arXiv:2103.12340)
본 논문에서는 occluder / occludee을 각각 처리하는 layer로 구성하고, 두 layer의 interaction 활용하는 Bilayer Occluder-Occludee structure를 제안한다.

Proposed Method

Architecture of BCNet with bilayer occluder-occludee relational modeling (source: arXiv:2103.12340)

영상 내에서 Heavy occlusion된 두 instance는 동일한 bounding box를 가지게 되며, contour를 확인하기 어렵다.
이런 한계를 극복하기 위해 제안 본 논문에서는 기존에 제안된 two-stage instance segmentation 방법을 확장한 BCNet architecture를 제안한다. BCNet은 다음과 같이 구성된다.

(1) ROI feature extraction을 위한 backbone과 FPN

(2) 각 Instance proposal의 bounding box를 예측하기 위한 object detection head (FCOS 적용)

(3) Bilayer GCN(Graph Convolutional Network)로 구성된 occlusion-aware mask head

→ Occluder/occludee에 대해 2개의 layer로 정의되며, mask와 contour prediction을 수행하도록 구성.

1. Bilayer Occluder-Occludee Modeling

(1) Bilayer GCN Structure for Instance Segmentation

Overlap 비율이 높은 object의 경우 겹쳐진 (occluded) object에 의해 분할되거나 겹침이 발생하여 작게 표시될 수 있으므로, occlusion에 잘 대응하기 위해 mask head의 기본 block으로 long-range relationship을 반영할 수 있는 GCN을 적용한다.
Edges $\mathcal{E}$와 nodes $\mathcal{V}$로 구성된 인접 graph (adjacency graph) $\mathcal{G}=\langle\mathcal{V,\mathcal{E}}\rangle$ 가 주어졌을 때, Graph Convolution operation은 다음과 같이 표현할 수 있다.

$$ Z=\sigma(AXW_g)+X $$

여기서 $X\in \mathbb{R}^{N\times K}$ 은 input feature이며, $N=H×W$ 은 ROI region 주변의 pixel grids의 수를 나타낸다. $A\in \mathbb{R}^{N\times N}$ 는 adjacency matrix, $W_g\in \mathbb{R}^{K\times K′}$ 는 학습 가능한 weight matrix를 나타낸다.
Adjacency matrix $A$를 구성하기 위해 모든 dot product similarity를 이용하여 graph node 사이의 pairwise similarity를 정의한다.

$$ A_{ij}=\text{softmax}(F(\mathbf{x}_i,\mathbf{x}_j)) \\ F(\mathbf{x}_i,\mathbf{x}_j)=θ(\mathbf{x}_i)^T \phi(\mathbf{x}_j) $$

여기서 $\theta$ 와 $\phi$ 는 $1\times 1$ convolution으로 구현되는 transformation function으로 feature간의 큰 similarity가 커지면 edge의 confidence가 커지도록 학습된다.

Adjacency matrix의 경우 Attention과 동일한 형태임을 code를 통해 확인할 수 있다.

# <https://github.com/lkeab/BCNet/blob/main/detectron2/modeling/roi_heads/mask_head.py#L411>

# x: B,C,H,W
# x_query: B,C,HW
x_query_bound = self.query_transform_bound(x).view(B, C, -1)

# x_query: B,HW,C
x_query_bound = torch.transpose(x_query_bound, 1, 2)

# x_key: B,C,HW
x_key_bound = self.key_transform_bound(x).view(B, C, -1)

# x_value: B,C,HW
x_value_bound = self.value_transform_bound(x).view(B, C, -1)

# x_value: B,HW,C
x_value_bound = torch.transpose(x_value_bound, 1, 2)

# W = Q^T K: B,HW,HW
x_w_bound = torch.matmul(x_query_bound, x_key_bound) * self.scale
x_w_bound = F.softmax(x_w_bound, dim=-1)

# x_relation = WV: B,HW,C
x_relation_bound = torch.matmul(x_w_bound, x_value_bound)

# x_relation = B,C,HW
x_relation_bound = torch.transpose(x_relation_bound, 1, 2)

# x_relation = B,C,H,W
x_relation_bound = x_relation_bound.view(B,C,H,W)
x_relation_bound = self.output_transform_bound(x_relation_bound)
x_relation_bound = self.blocker_bound(x_relation_bound)

x = x + x_relation_bound

저자가 제안한 bilayer GCN 구조의 output feature는 다음과 같이 구성된다.

$$ Z^1=\sigma(A^1X_fW_g^1)+X_f \\ X_f=Z^0W_f^0+X_{roi} \\ Z^0=\sigma(A^0 X_{roi}W_g^0)+X_{roi}

$$

여기서 $\mathcal{G}^i$ 는 $i$번째 graph, $X_{roi}$는 입력되는 ROI feature, $\mathbf{W}_f$ 는 mask head에 적용되는 FCN layer의 weights이다.