Adabin: Depth Estimation using Adaptive Bins

Resource

Problem Statement

기존의 CNN기반 encoder-decoder architecture는 global context를 충분히 반영하지 못함 → abstraction level을 충분히 낮춰 low-resolution이 되는 일부 layer에서만 global context가 추출됨 ⇒ High-resolution feature에서 global information을 이용하면 더 좋은 성능을 나타낼 것임
Detph estimation에서 영상마다 depth range가 다름. 특정 depth range의 DB로 학습한 경우 다른 depth range를 가지는 영상에서 resolution이 낮아지거나, 성능이 낮아질 수 있음. (ill-posed problem)

영상에 따라 depth의 범위가 다름.

Contribution

scene에 대한 전역적인(global) processing을 수행하는 architecture building block 제안. 제안한 구조는 추정된 depth range를 bins로 나누는데, bin width는 영상마다 달라짐. bin center value에 대해 linear combination으로 최종 depth를 추정한다.
모든 supervised MDE에 대해 NYT Detph v2, KITTI DB에 대해 모든 metric에서 기존 연구보다 좋은 성능을 나타냄.
제안된 Adabin block에 대한 다양한 수정을 가해 depth estimation 정확도에 미치는 영향을 조사/분석함.

Related Works

DORN
- 추정된 depth map을 binning한 후 ordinal regressor를 이용하여 최종 depthmap 도출
- 큰 depth를 가질수록 relatively error가 커지는 경향이 있음. → 큰 depth를 가질 때 더 큰 loss를 가지므로, 학습 과정에서 큰 depth에 치중하게 됨. (디테일을 놓칠 수 있음)
- Spacing-incresing discreization을 적용하여 depth 구간에 따른 학습 과정의 영향을 최소화 → fixed size bin을 사용
$$ t_i=e^{\log(\alpha)+\frac{\log (\beta/\alpha)}{K}} $$
- Depth를 특정 interval로 분할한 후 각 구간에 대한 포함 여부를 ordinal Regression (ordinal classifier)을 이용하여 추정
  
  https://towardsdatascience.com/deep-ordinal-logistic-regression-1afd0645e591
  
  $$ \mathcal{L}(\mathcal{X}, \Theta)=-\frac{1}{N}\sum_{w=0}^{W-1}\sum_{h=0}^{H-1}\Psi(w,h,\mathcal{X},\Theta) $$
  
  여기서 $\mathcal{X}=\varphi(I,\Phi)$ is feature map을 $Y=\psi(\mathcal{X}, \Theta)$를 $W \times H \times 2K$크기의 ordinal regressor의 output이라하면**,** Average of pixelwise ordinal loss는 다음과 같이 정의된다.
  
  $$ \Psi(w,h,\mathcal{X}, \Theta)=\sum_{k=0}^{l_{(x,y)}-1}\log \left( \mathcal{P}{(w,h)} \right)+\sum{k=l_{(x,y)}}^{K-1}\left( 1-\log\left( \mathcal{P}_{(w,h)} \right)\right) $$
  
  또한 Depth value가 구간 [k, K+1]에 속할 확률 $\mathcal{P}$는 다음과 같이 정의된다. (softmax 함수 사용)
  
  $$ \mathcal{P}{(w.h)}^k=P\left( \hat{l}{(w,h)}>k|\mathcal{X}, \Theta\right) $$
  
  $$ \mathcal{P}_{(w.h)}^k=\frac{e^{y(w,h,2k+1)}}{e^{y(w,h,2k)}+e^{y(w,h,2k+1)}} $$
  
  Ordinal regression 결과를 이용하여 Predicted depth/label 산출할 수 있다.
  
  $$ \hat{d}{(w,h)}=\frac{t{\hat{l}{(w,h)}}+t{\hat{l}_{(w,h)+1}}}{2}-\xi $$
  
  $$ \hat{l}{(w,h)}=\sum{k=0}^{K-1}\eta\left(\mathcal{P}_{(w.h)}^k>=0.5\right) $$

Methodology

1. Overview

Self Attention을 이용한 input scene의 features에 따라 dynamical하게 변하는 adaptive bins 계산 방법을 제안
classification 기법을 통해 이진화된 얻은 depth value는 시각적으로 품질이 좋지 않으므로, bin center에 대해 linear combination을 취해 최종 depthmap을 계산