Pyramid Scene Parsing Network introduces the Pyramid Pooling Module to improve semantic segmentation. The module captures context at different regions and scales by performing average pooling at different pyramid levels on the final convolutional feature map. Experiments on ADE20K and PASCAL VOC datasets show the Pyramid Pooling Module improves mean Intersection-over-Union by over 4% compared to global average pooling, achieving state-of-the-art performance.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
1. Pyramid Scene Parsing Network
Hengshuang Zhao1
, Jianping Shi2
, Xiaojuan Qi1
,
Xiaogang Wang1
, Jiaya Jia 1
1
The Chinese University of Hong Kong, 2
SenseTime Group Limited
Presentation: Shunta Saito
Slide: Powered by Deckset
(c) Preferred Networks 1
2. Summary
• Introduce Pyramid Pooling Module for better context grasp with sub-region awareness
(c) Preferred Networks 2
3. Why did I choose this paper?
• Presented in CVPR 2017
• 1st place in ImageNet Scene Parsing Challenge
2016 (ADE20K)
• was 1st place in Cityscapes leaderboard
• now it's in 2nd place (I noticed this last week!)
(c) Preferred Networks 3
4. Agenda
1. Common building blocks in semantic segmentation
2. Major Issue
3. Prior Work
4. Pyramid Pooling Module
5. Experiment results
(c) Preferred Networks 4
5. Semantic Segmentation
• Predict pixel-wise labels from natural
images
• Each pixel in an image belongs to an
object class
• So it's not instance-aware !
(c) Preferred Networks 5
6. Common Building Blocks (1)
Fully convolutional network (FCN)1
• A deep convolutional neural network
which doesn't include any fully-
connected layers
• Almost all recent methods are based
on FCN
• Typically pre-trained with ImageNet
under classification problem setting
1
"Fully Convolutional Networks for Semantic Segmentation", PAMI 2016
(c) Preferred Networks 6
7. Common Building Blocks (2)
Dilated convolution2
• Widen receptive field without reducing
feature map resolution
• Important for leveraging global context
prior efficiently
2
"Multi-Scale Context Aggregation by Dilated Convolutions", ICLR 2016
(c) Preferred Networks 7
8. Common Building Blocks (3)
Multi-scale feature ensemble
• Higher-layer feature contains more
semantic meaning and less location
information
• Combining multi-scale features can
improve the performance3
3
"Hypercolumns for Object Segmentation and Fine-grained Localization",
CVPR 2015
(c) Preferred Networks 8
9. Common Building Blocks (4)
Conditional random field (CRF)
• Post-processing to refine the
segmentation result (DeepLab4
)
• Some following methods refined network
via end-to-end modeling (DPN5
, CRF as
RNN6
, Detections and Superpixels7
)
7
"Higher order conditional random fields in deep neural networks", ECCV
2016
6
"Conditional random fields as recurrent neural networks", ICCV 2015
5
"Semantic image segmentation via deep parsing network", ICCV 2015
4
"Semantic image segmentation with deep convolutional nets and fully
connected crfs", ICLR 2015
(c) Preferred Networks 9
10. Common Building Blocks (5)
Global average pooling (GAP)
• ParsenNet8
proved that global average
pooling with FCN can improve semantic
segmentation results
• But the global descriptors used in the
paper are not representative enough for
some challenging datasets like ADE20K
8
"Parsenet: Looking wider to see better", ICLR 2016
(c) Preferred Networks 10
11. Major Issue (1)
Mismatched relationship
• Co-occurrent visual patterns imply some
contexts
• e.g., an airplane is likely to fly in sky
while not over a road
• Lack of the ability to collect contextual
information increases the chance of
misclassification
• In the right figure, FCN predicts the boat
in the yellow box as a "car" based on its
appearance
(c) Preferred Networks 11
12. Major Issue (2)
Confusing Classes
• There are confusing classes in major datasets: field
and earth; mountain and hill; wall, house, building
and skyscraper, etc.
• The expert human annotator still makes 17.6%
pixel error for ADE20K9
• FCN predicts the object in the box as part of
skyscraper and part of building but the whole object
should be either skyscraper or building, not both
• Utilizing the relationship between classes is
important
9
"Semantic understanding of scenes through the ADE20K dataset",
CVPR 2017
(c) Preferred Networks 12
13. Major Issue (3)
Inconspicuous Classes
• Small objects like streetlight and
signboard are inconspicuous and hard
to find while they may be important
• Big objects may appear in
discontinuous, but FCN couldn't label
the pillow which has similar
appearance with the sheet correctly
• To improve performance for small or
very big objects, sub-regions should be
paid more attention
(c) Preferred Networks 13
14. Summary of Issues
• Use co-occurrent visual patterns as context
• Consider relationship between classes
• Sub-regions should be paid more attention
(c) Preferred Networks 14
15. Prior Work
Global Average Pooling (GAP)10
• Receptive field of ResNet is already
larger than the input image, so GAP
sounds good to summarize the all
information
• But, pixels in an image may be various
objects which have different sizes, so
directly fusing them to form a single
vector may lose the spatial relation
and cause ambiguity
10
"Parsenet: Looking wider to see better", ICLR 2016
(c) Preferred Networks 15
16. Prior Work
Spatial Pyramid Pooling (SPP)11
• Pooling with different kernel/stride
sizes to the feature maps
• Then flatten and concatenate the
pooling results to make fix-length
representation
• There still is context information loss
11
"Spatial pyramid pooling in deep convolutional networks for visual
recognition", ECCV 2014
(c) Preferred Networks 16
17. Pyramid Pooling Module
• A hierarchical global prior, containing information with different scales and varying among different sub-regions
• Pyramid Pooling Module for global scene prior constructed on the top of the final-layer-feature-map
(c) Preferred Networks 17
18. Pyramid Pooling Module
• Use 1x1 conv to reduce the number of channels
• Then upsample (bilinear) them to the same size and concatenate all
(c) Preferred Networks 18
19. Implementation details (1)
• The average pooling are four levels, 1x1, 2x2,
3x3, and 6x6 (ksize, stride)
• Pre-trained ResNet model with dilated
convolution is used as the feature extractor
(the output size will be 1/8 of input image)
• They use two losses;
1. softmax loss between final layer and labels
2. softmax loss between an intermediate
output of ResNet and labels12
(weighted by
0.4)
12
"Relay backpropagation for effective learning of deep convolutional
neural networks", ECCV 2016
(c) Preferred Networks 19
21. Implementation details (3)
Training iteration Dataset augmentation
ADE20K: 150K Random mirror
PASCAL VOC: 30K Random resize between 0.5 and 2
Cityscapes: 90K Random rotation betwee -10 and 10
degrees
Random Gaussian blur for ADE20K
and PASCAL VOC
(c) Preferred Networks 21
22. Implementation detailts (4)
• An appropriately large "cropsize" can yield good performance
• "batchsize" in the batch normalization layer is of great importance:
Cropsize Batchsize
ADE20K: 473 x 473 16 for all dataset
PASCAL VOC: 473 x 473
Cityscapes: 713 x 713
(c) Preferred Networks 22
23. Implementation detailts (5)
MultiNode Batch Normalization
• To increase the "batchsize" in batch
normalization layers, they used custom
BN layer applied on data gathered from
multiple GPUs using OpenMPI
• We have Akiba-san's implementation of
multi-node batch normalization !
(c) Preferred Networks 23
24. ImageNet Scene Parsing
Challenge 2016
• Dataset: ADE20K
• 150 classes and 1,038 image-level
labels
• 20,000/2,000/3,000 pixel-level labels
for train/val/test
(c) Preferred Networks 24
25. Ablation Study for
Pyramid Pooling Module
• Average pooling works better than max
pooling in all settings
• Pooling with pyramid parsing
outperforms that using global pooling
• With dimension reduction (DR; reducing
the number of channels after pyramid
pooling), the performance is further
enhanced
(c) Preferred Networks 25
26. Ablation Study for
Auxiliary Loss
• Set the auxiliary loss weight between
0 and 1 and compared the final results
• yields the best performance
(c) Preferred Networks 26
27. Ablation Study for the
depth of ResNet
Deeper is better
(c) Preferred Networks 27
28. More Detailed
Performance Analysis
Additional processing Improvement (% in mIoU)
Data augmentation (DA) +1.54
Auxiliary loss (AL) +1.41
Pyramid pooling module (PSP) +4.45
Use deeper ResNet (50 to 269) +2.13
Multi-scale testing (MS) +1.13
• For multi-scale testing, they create prediction at 6 different
scales (0.5, 0.75, 1, 1.25, 1.5, and 1.75) and take average of them.
(c) Preferred Networks 28
29. Results on PASCAL VOC
2012
• Extended with Semantic Boundaries Dataset (SBD) 13
, they
used
• 10582, 1449, and 1456 images for train/val/test
• Mismatched relationship: For "aeroplane" and "sky" in the
second and third rows, PSPNet finds missing parts.
• Confusing classes: For "cows" in row one, our baseline
model treats it as "horse" and "dog" while PSPNet corrects
these errors
• Conspicuous objects: For "person", "bottle" and "plant" in
following rows, PSPNet performs well on these small-size-
object classes in the images compared to the baseline model
13
"Semantic Contours from Inverse Detectors", ICCV 2011, http://
home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 29
30. Results on PASCAL VOC 2012
• Comparing PSPNet with previous best-performing methods on the testing set based on two settings, i.e., with or without pre-training
on MS-COCO dataset
(c) Preferred Networks 30
31. Results on Cityscapes
• Cityscapes dataset consits of 2975, 500, and 1525 train/val/tests images (19
classes)
• 20000 coarsely annotated images are available (in the table below, ‡ means it's used)
(c) Preferred Networks 31
32. Thank you for your attention
• The official repository doesn't include any training code
• My own implementation for both training and testing have been ready:
• mitmul/chainer-pspnet: https://github.com/mitmul/chainer-pspnet
• Now I'm training a model to ensure the reproducibility
• Once finished the reproduction work, I'll send the code to ChainerCV
• In semantic segmentation task,
• input image is large (713 for PSPNet on cityscapes)
• appropriate batchsize, e.g., 16 or so, is important for batch normalization
• As the authors said, distributed batch normalization seems to be important in multi-GPU training
• So, now ChainerMN is necessary tool for such large-scale dataset and deep models
• It means that we need more GPU machines connected with InfiniBand
(c) Preferred Networks 32