semantic segmentation サーベイ

semantic segmentation
サーベイ
2019.4.19
hei4

本日の発表について
●
FCN 以降の semantic segmentation の手法について共有します
●
NN 以前の手法や、 NN でも FCN 以前の手法は紹介しません
●
紹介する手法の選択基準は独断ですが、
　後の研究に大きな影響を与えたと思う手法や
　 SOTA な手法（ 2019.4.19 現在）を選択したつもりです
●
SOTA は papers with code で判断しました
SOTA!

ニューラルネットワークによる画像認識
classification semanti segmentation
NN NN
車
船
飛行機
ヘリコプター
クラス数
1pix
1pix
RGB
=3 チャネル
クラス数チャネル
Wpix
Hpix
クラス数チャネル
Wpix
Hpix
クラス数だけのスコア
（クラス数次元のベクトル）画像

semantic segmentation の課題
●
物体のコンテキスト（意味）を得るためには、広い受容野が必要
●
広い受容野を形成するにはダウンサンプリングが効果的だが
（プーリング、ストライド、 ... ）
ダウンサンプリングによって解像度が落ちると物体の輪郭情報が失われる

FCN (2015) [1]
Fully Convolutional Networks (FCN)
●
full connection がなく、すべて convolution で構成
●
full connection がないので任意の画像サイズの入力が可能
32 倍にアップ
サンプリング

FCN
●
一気に 32 倍アップサンプリングすると輪郭がほぼない
●
途中の特徴マップを使って段階的に
　アップサンプリング＆ element-wise addition （加算）
●
（更に -4s や -2s まで拡張すると・・・？）
　→ U-Net がそれに近い
32 倍にアップ
サンプリング
2 倍
2 倍
16 倍 8 倍

U-Net (2015) [2]
●
対称な Encoder - Decoder 構造
●
Encoder の特徴マップを
　 Decoder 側に concat （チャネル連結）
●
失われた輪郭情報を、
　ダウンサンプリング前の
　特徴マップから補っている
●
FCN と違い concat
SOTA!
Medical Image Segmentation
on ISBI 2012 EM Segmentation

U-Net
●
ISBI cell tracking challenge 2015 で
　従来手法（スライディングウィンドウ方式）に大差をつけて優勝
●
後の一般画像 segmentation の研究にも大きな影響を与えた

課題への対策①
●
●
ネットワークの構造を工夫する

DeconvNet (2015) [3]
●
対称な Encoder - Decoder 構造
●
Encoder 部分は VGG-16 （ full connection 込み←重要）
full connection
特徴ベクトル
VGG-16

DeconvNet
●
Encoder で max pooling した位置を記録
●
Decoder の unpooling 時に使用
●
DeconvNet 単体での精度は良好とは言えず、
　 FCN とのアンサンブルで効果を主張。曰く、
　「 DeconvNet は輪郭を捉え、
　　 FCN は概形を捉えることに長けている」
”our deconvolution network is appropriate to capture the fine-details of an object,
whereas FCN is typically good at extracting the overall shape of an object.”

DeconvNet
convolution convolution
convolution convolution convolution
upsampling upsampling
upsampling upsampling

SegNet (2017) [4]
●
Encoder – Decoder 構造で Encoder は VGG-16 を流用。
　 Encoder の pooling インデックスを Decoder の unpooling に使用
●
DeconvNet とは full connection がないことが違う
●
SUNRGB-D データセットで FCN および DeconvNet より高精度を確認
VGG-16 （ full connection なし）

DeepUNet (2018) [5]
U 結合 (concat)
プラス結合 (element-wise addition)

DeepUNet
●
UNet に residual 構造を加えたことで、層が深いネットワーク構造が可能に
●
航空写真の２値分類（陸 / 海）タスク用の研究
●
（当該タスクでは、） U-Net および SegNet より高精度を確認
DownBlock
（ Encoder 側）
UpBlock
（ Decoder 側）

Attention U-Net (2018) [6]
●
Encoder 側の特徴マップを attention したマップに変換
●
attention したマップを Decoder 側に concat
attention したマップ
Attention Gate
ゲート信号
SOTA!
Pancreas Segmentation
on CT-150

Attention U-Net
ゲート信号
（低解像度の
特徴マップ）
高解像度の
特徴マップ
低解像度
sigmoid で [0, 1]
アップサンプリング
高解像度
attention されたマップ

Attention U-Net
●
上図は 3D 腹部 CT データにおいて、学習と共にアテンションが
　膵臓、腎臓、脾臓に集中する様子
●
TCIA Pancreas-CT データセットの segmentation タスクで
　 U-Net より高精度を確認
3 エポック 6 エポック 10 エポック 60 エポック 150 エポック

課題への対策②
●
●
モジュールを工夫する

PSPNet (2017) [7]
●
ResNet で特徴マップを作成してから、 Spatial Pyramid Pooling (SPP) で
　グリッドサイズの異なるグローバルなコンテキストを抽出
●
“ImageNet scene parsing challenge 2016” で FCN 、 SegNet より高精度を確認
●
高精度だが、とにかく動作が遅い模様　（出典： ESPNet ）
Spatial Pyramid Pooling
ResNet
concat
入力の 1/8 サイズ
SOTA!
Real-Time Semantic Segmentation
on Cityscapes
Pyramid Scene Parsing Network
SPP+conv+upsample+concat

DeepLab -v2 (2017) [8]
●
ASP モジュールで解像度を低下せずに広い受容野を実現
ASP モジュール
CRF (Conditional Random Fields)

DeepLab -v2
●
atrous convolution: dilated convolution の別名
　 kernel の間隔を広げることで、低解像度にせずに広い受容野を形成
●
ASP モジュールは間隔の異なる arous convolution を組み合わせており、
　異なる受容野の特徴量が抽出できる

DeepLab -v3+ (2018) [9]
●
v2 では Decoder はなく、アップサンプリングと CRF 後処理で推論していた
●
v3+ ではシンプルかつ軽量な Decoder を付与　（若干 U-Net 風味）
SOTA!
Semantic Segmentation
on PASCAL VOC 2012

DeepLab -v3+
●
さらに Xception のように separable (depth-wise & point-wise) に
　 atrous convolution を分解して高精度化
●
（ネットワークサイズは？ separable にすると軽くなりそうだが・・・）
●
Pascal VOC 2012 データセットで PSPNet より高性能を確認

ESPNet (2018) [10]
●
Efficient Spatial Pyramid (ESP) モジュール
　を使って軽量化＆高速化
●
入力画像をダウンサンプリングした画像を
　ネットワークの途中で入力している
　（ ablation study で効果を検証）
●
Encoder 側のみを学習後、
　 Decoder を追加して学習
ESP モジュール
1/2 サイズ
1/4 サイズ
Efficient Spatial Pyramid Networks

ESPNet
ASP モジュール from DeepLab ESP モジュール from ESPNet
●
ESPNet モジュールは concat を
　使うことで軽量化を実現
●
格子状アーティファクトが発生しないよう
　 Hierarchical Feature Fusion (HFF)
　を使用してアーティファクト抑制
出力 N チャネル
N チャネル
出力 N チャネル
d=N/K チャネル

結局どの手法が良いのか？

精度 vs ネットワークサイズ [10]
Cityscape データセットでの結果
PSPNet は高精度
FCN はサイズと精度で
PSPNet に劣る
v3+ は v2 より高精度な筈
（サイズは？）
?

精度 vs 速度 [10]
Cityscape データセットでの結果。速度は GeForce GTX 960M で計測
高速な ESPNet
SegNet は速度と精度で
ESPNet に劣る
v3+ は v2 より高精度な筈（速度は？）?

総括
●
FCN 以降の semantic segmentation をサーベイ
●
精度重視なら DeepLab-v3+ 、あるいは PSPNet
●
速度重視、軽量重視なら ESPNet
●
しかし・・・

R2U-Net (2018) [11]
R2U-Net というものもある・・・
SOTA!
Lung Nodule Segmentation
on LUNA

Auto-DeepLab (2019) [12]
Auto-DeepLab というものもある・・・

To Be Continued ...
他にも様々な手法がありますが、
今回はここまで

出典
●
[1] Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation."
Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
●
[2] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image
segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham,
2015.
●
[3] Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation."
Proceedings of the IEEE international conference on computer vision. 2015.
●
[4] Badrinarayanan, Vijay, Alex Kendall, and Roberto Cipolla. "Segnet: A deep convolutional encoder-decoder architecture
for image segmentation." IEEE transactions on pattern analysis and machine intelligence 39.12 (2017): 2481-2495.
●
[5] Li, Ruirui, et al. "DeepUNet: a deep fully convolutional network for pixel-level sea-land segmentation." IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 99 (2018): 1-9.
●
[6] Oktay, Ozan, et al. "Attention U-Net: learning where to look for the pancreas." arXiv preprint arXiv:1804.03999 (2018).
●
[7] Zhao, Hengshuang, et al. "Pyramid scene parsing network." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2017.
●
[8] Chen, Liang-Chieh, et al. "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs." IEEE transactions on pattern analysis and machine intelligence 40.4 (2018): 834-848.
●
[9] Chen, Liang-Chieh, et al. "Encoder-decoder with atrous separable convolution for semantic image segmentation."
Proceedings of the European Conference on Computer Vision (ECCV). 2018.
●
[10] Mehta, Sachin, et al. "Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation." Proceedings
of the European Conference on Computer Vision (ECCV). 2018.
●
[11] Alom, Md Zahangir, et al. "Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image
segmentation." arXiv preprint arXiv:1802.06955 (2018).
●
[12] Liu, Chenxi, et al. "Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation." arXiv
preprint arXiv:1901.02985 (2019).

semantic segmentation サーベイ

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a semantic segmentation サーベイ

Similar a semantic segmentation サーベイ (20)

Más de yohei okawa

Más de yohei okawa (15)

Último

Último (7)

semantic segmentation サーベイ