Semantic Mapping of Road Scenes

P H D T H ES I S D E F E N C E
S U N A N D O S E N G U P TA
OX FO R D B RO O K ES U N I V E RS I T Y
Semantic Mapping of Road
Scenes
1
Supervisors – Prof. Philip Torr and Prof. David Duce
16/06/2014

Outline
 Introduction
 The Labelling problem
 Dense Semantic Map (chap. 3)
 Dense 3D Semantic Modelling (chap. 4)
 Mesh Based Inference (chap. 5)
 Hierarchical CRF on an Octree Graph (chap. 6)
 Conclusion
2

Objective
 Holy grail of computer vision
 What are the objects present in the scene
 Where are they located
 Biological vision performs these two activities through human
visual perception.
 Computers ( or humans through them) try to solve the same
issue through an information processing route.
 Gather sensor data (images, gps, imu,…)
 Represent them into a map
 Recognise objects in the map
 This thesis aims to look in this very problem and propose
solution towards addressing it.
3
Can happen simultaneously or
sequentially
Chap 1, Sec 1.2

Objective - Visually
 Input image of a street scene, person cleaning, some cars in the
background, and buildings in the horizon.
 Place the appropriate objects at right distance from camera in correct size.
4
Chap 1, Sec 1.2
Image courtesy: Antonio Torallba,
http://6.869.csail.mit.edu/fa13/

Why it is important
5
 Numerous applications from robotics, entertainment,
engineering, medical…
 Self driving cars
 Engineering
 Robots for manipulation
 Humanoids
 Assistive vision for impaired
 Entertainment
 Aim for a vision based system to produce a semantically
consistent scene from visual inputs
Chap 1, Sec 1.2

Essentially a hard problem
6
 Large variation in the image formulation
 Scene Variation
 Varying scene type and geometry
 Object level variation
 Large number of object classes
 Individual Object location and orientation
 Object shape and appearance
 Depth/occlusions
 Illumination
 Shadows
 Motion blur
Chap 1, Sec 1.2

Thesis - Contributions
7
 This thesis provides solutions for large scale outdoor
urban semantic mapping.
 Large scale Dense overhead semantic mapping.
 Semantic from local images fused to
form a global ground plane map
 First attempt to generate such map.
 ~15km of semantic mapping
 One of the first large scale semantic map
 Presented as oral in IEEE IROS 2012
Chap 1, Sec 1.3

8
 Dense semantic reconstruction
 Dense 3D semantic reconstruction from kms of
stereo images.
 Online sequential volumetric reconstruction to
accommodate arbitrarily long road scenes.
 Presented as oral in IEEE ICRA 2013.
 Mesh based inference for scene labelling
 Improved labelling accuracy and consistency.
 Depth sensitive classifier fusion.
 25x faster in inference time (than image labelling).
 Presented as poster in CVPR 2013.
Chap 1, Sec 1.3

9
 Hierarchical CRF on an Octree Graph
 Unified framework to determine free and
occupied regions in a scene along with
object class labels.
 Robust PN potential over octree volumes
 Datasets (available online)
 Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
 Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset
Chap 1, Sec 1.3

Publications
10
 Related to Thesis
 S. Sengupta, P. Sturgess, L. Ladicky, P. H. S. Torr: Automatic dense visual semantic mapping from street-
level imagery. IEEE/RSJ IROS 2012 (Chapter 3 )
 S. Sengupta, E. Greveson, A. Shahrokni, P. H.S. Torr: Urban 3D Semantic Modelling Using Stereo Vision, IEEE
ICRA, 2013 (Chapter 4 )
 S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes, IEEE CVPR, 2013. ( *Joint first authors, Chapter 5.)
 S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes. SUNw: Scene Understanding Workshop. Held in conjunction with CVPR , 2013.
(*Joint first authors, Invited paper )
 Datasets
 Yotta Labeled road scene dataset.
 KITTI object labelling. (Datasets available at http://www.robots.ox.ac.uk/~tvg/projects )
 Other publications
 Z. Zhang, P. Sturgess, S. Sengupta, N. Crook, P. H.S. Torr: Efficient discriminative learning of parametric
nearest neighbor classifiers, IEEE CVPR, 2012
 L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr: Joint Optimization
for Object Class Segmentation and Dense Stereo Reconstruction. IJCV 2012 (Invited paper)
 L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr : Joint Optimisation
for Object Class Segmentation and Dense Stereo Reconstruction. BMVC 2010 (BMVA Best science paper )
Chap 1, Sec 1.4

 Multiple computer vision task modelled as labelling problem
 Assign a discrete set of sites a label from the set
 E.g. pixel associated with an object class label
The labelling problem
11
Chap 2, Sec 2.1

12
What are the Labels
 Discrete or continuous
 Discrete
 Image pixels assigned to object classes like Cars, humans, buildings, pavement,
trees etc.
 Foreground/background labels
 Indoor/outdoor labels…
 Continuous range
 Depth: Pixels can take a set of disparity labels
 Optical flow
Chap 2, Sec 2.1

13
CRF-Framework
 Set of random variables corresponding to each
pixel and the label set
 Aim is to associate every random variable with a label
 The conditional probability of the labelling x given the data D,
 Gibbs free energy is given as
 MAP labelling x*of the random field is defined by
},...,,{ 21 NxxxX 
Chap 2, Sec 2.2

14
• The pixel labelling problem can be formulated as an pair-
wise/higher-order CRF problem whose energy is
• The image is represented as a graph: G = {V,E}
• V is the total set of nodes of the graph
• Ni represents the neighbourhood of the node i
• The unary potential measures the cost of assigning
particular label to the pixel
• Generated using the result of a boosted classiﬁer over a
region about each pixel
CRF modelling for image labelling
Chap 2, Sec 2.2

15
• The pairwise term or the smoothness term depends on
the inter-pixel observations, should be discontinuity preserving
across the object boundaries
• Takes Potts form
• where
• Higher order potentials defined on a group of pixels conditionally
dependant on each other.
• Robust PN, Hierarchical PN models [1]
• Final labelling obtained through minimising the Energy E
CRF modelling for image labelling
Chap 2, Sec 2.2
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.

16
Quite hard
 The energy minimization is quite hard (large number of
random variables with interconnections).
 Possible solution – simulated annealing, ICM, but slow.
 Approximate algorithms exist for certain energy functions
for a multi-label problem.
 Move-making algorithms[1]
 α – expansion: for each α, allow the random variables to retain existing label or
change to the label α, using graph cuts.
 αβ swap: considers a pair of label at each iteration, such that all pixels change
their label from β to α though graph cuts.
Chap 2, Sec 2.2[1]Boykov et.al. Fast Approximate Energy Minimization via Graph Cuts, ICCV

Stereo
 Early attempts to explain depth begins in the renaissance
 Essentially the images subtended at the left and right eyes can
be used to obtain a disparity/depth map
17
Stereo sketch by Jacopo Chimenti da Empoli,
Italy , around 1600 AD
Leonardo da Vinci, Optical Studies
on Binocular vision
Chap 2, Sec 2.3

Depth from Sequence of images
18
 Structure from motion for sparse 3d reconstruction.[1]
 Visual hull/Silhouettes based volume carving[2]
 Elevation/Height/2.5D maps[3]
 Tsdf/Voxel based Fusion[4]
Chap 2, Sec 2.3
[1] Sameer A. et.al. Building rome in a day. Commun. ACM, 2011.
[2] Friedrich E. Al. Stixmentation - probabilistic stixel based traffic scene labeling. BMVC 12
[3] Y. Furukawa et.al. Carved visual hulls for image-based modeling. IJCV, 2009
[4] Richard N. et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.

Dense Semantic Mapping
 Generate an overhead view of an urban region.
 Label every pixel in the Map View is associated with an
object class label
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
19
Chap 3, Sec 3.1

 Street images captured inexpensively from vehicle with
multiple mounted camera[1].
[1] Yotta. DCL, “Yotta dcl case studies,” Available: http://www.yottadcl.com/surveys/case-studies/
20
Dense Semantic Mapping

Semantic Mapping Framework
 Semantic mapping framework comprises of two stages
Street level Images
acquisition
21
Chap 3, Sec 3.3

 Semantic Image Segmentation at street level.
Street level Images
acquisition
Image
Segmentation
22

 Semantic Image Segmentation at street level.
 Ground Plane Labelling at a global level.
 First attempt to do an overhead mapping from street
level images.
Street level Images
acquisition
Image
Segmentation
Ground plane
labelling
23

Street-level Image Segmentation
 Label every pixels in the image with object class labels
Input Output
Raw Image Labelled Image
Automatic
Labeller
Object Class Labels
24
Chap 3, Sec 3.3.1

Street-level Image Segmentation
25
 CRF based image labeller
 Each pixel is a node in a grid graph G = (V,E).
 Each node is a random variable x taking a label from
label set.
CRF
construction
Final SegmentationInput Image

Semantic Image Segmentation - CRF
26
 Total energy
 Optimal labelling given as
 

Cc
cc
NjVi
jiij
Vi
ii
i
xxxE )(),()()(
,
xx 
Epix Epair
Eregion

 Total energy E = Epix + Epair + Eregion
 Epix - Model individual pixel’s cost of taking a label.
 Computed via the dense boosting approach
 Multi feature variant of texton boost[1]
27
x
Car 0.2
Road 0.3

 Epair - Model each pixel neighbourhood interactions.
 Encourages label consistency in adjacent pixels
 Sensitive to edges in images.
 Contrast sensitive Potts model
xi xj
CarCar
Road
0
g(i,j) Road
28
Epair

 Eregion - Model behaviour of a group of pixels.
 Classify a region
 Encourages all the pixels in a region
to take the same label.
 Group of pixels given by multiple meanshift segmentations
29

30
 Energy minimisation using alpha-expansion algorithm[1]
Input Image Road Expansion
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99
30

31
Input Image Building Expansion
31
 Solved using alpha-expansion algorithm[1]

Input Image Sky Expansion
32

Input Image Pavement Expansion
33

Input Image Final solution
34

Ground Plane Labelling
 Combine many labellings from street level imagery.
Automatic
Labeller
Output
Labelled Ground PlaneStreet Level
labellings
Input
35

Ground Plane CRF
 A CRF defined over the ground plane.
 Each ground plane pixel (zi) is a random variable taking a
label from the label set.
 Energy for ground plane CRF is
Z
36
g
pair
g
pix
g
EEZE )(
Chap 3, Sec 3.3.2

37
Ground Plane Pixel Cost
 We assume a flat world.
K
X
Z
37

Homography Road Pavement Post/Pole
K
X
Z
 A ground plane region is estimated.
38
38

• Each point in the image projects to a unique point on the
ground plane.
– Creating a homography
K
X
Z
39
39

• The image labelling is mapped to the ground plane
– via the homography.
K
X
Z
Ground plane
Pixel histograms
40
40

• Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
41
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
41
41

• Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
42 Chap 3, Sec 3.3.2
42

Ground Plane labelling
 Histogram is built for every ground plane pixel giving Eg
pix
 Pairwise cost (Eg
pair) added to induce smoothness
 Contrast sensitive potts model
Z
43

 Final CRF solution obtained using alpha expansion.
Void
44

Road expansion
45

Building expansion
46

Pavement expansion
47

Ground Plane Labelling
Final Solution
48

Experiments - Dataset
 Subset of the images captured by the van
 ~15 km of track, 8000 images from each camera.
 Pixel-level labelled ground truth images. Dataset
available[1].
 13 object categories –
 Training - 44 images, testing - 42 images.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticMap/index.php
49
Chap 3, Sec 3.4.1

SIS Results
 Input Images, output of our image level CRF, ground truths.
 Used Automatic Labelling environment[1]
[1] The Automatic Labelling Environment, L Ladicky, PHS Torr. Code available
http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm
50
Input
Semantic
segmentation
Ground Truth

Semantic Map Results
51
Semantic map of Pembroke city
Chap 3, Sec 3.4.2

Ground plane Map Evaluation
52
Street Images
Back-projected
Map results
Ground Truth
• We back-project the ground plane map into image domain
and evaluate the results.
• Global pixel accuracy of 83%
52
52

Chapter Summary
 Presented a method to generate
overhead view semantic mapping.
 Experiments on large tracks (~15km)
which can be scaled up to country
wide mapping
 Dataset available[1].
 However a flat world assumption
does not represent the 3D scene
properly – our aim is to perform a
semantic metric reconstruction of
the world.
[1] http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticMap/index.php
54

Urban 3D Semantic Modelling Using Stereo Vision
55
[1]
Input Stereo image Sequence Dense 3D Semantic Model
 Given a sequence of stereo images we generate a
dense 3D, semantic model
Chap 4, Sec 4.1

Pipeline –Semantic Reconstruction
56
 Stereo images
Chap 4, Sec 4.3

57
 Stereo images
 Camera pose estimation and individual depth map generation

58
 Surface reconstruction

59
 Semantic labelling of street view images

60
 Semantic model generation

Camera Estimation
61
 Feature tracking using left-right pair and consecutive
frames
Chap 4, Sec 4.3.1

Camera Estimation
 Use the feature tracks to
estimate camera poses.
 Use bundle adjustment
[a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Suite CVPR 2012
62

Bundle Results
63
 Bundler results after 10, 20, 30 and 40 frames

Sparse Reconstructions
64
 But our target is to
obtain a large scale
dense 3D world
representation.

Depth-Map Estimation
 Semiglobal block matching[1] for disparity estimation
 Per-pixel depth computed as z = B × f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baseline
f - Focal Length
d – pixel disparity
65

Depth Fusion
 Depth estimates are fused using
camera poses.
 Fused into truncated signed
distance (TSDF) volumetric
representation[1].
 Surface mesh generated though
marching tetrahedra algorithm.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
Chap 4, Sec 4.3.2
66

Depth fusion using TSDF Volume [1]
 Entire space divided into grids of voxels.
 For each voxel compute the truncated signed distance.
 +ve increasing when it lies in the free space,
 -ve when it lies behind the surface
 zero when lies on the surface
 Performed for all depth maps.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
67

TSDF Volume
-.8
-.4 .1 .5 1
1 1
Camera
Actual
surfaceTSDF volume
68

TSDF Volume
-1 -.8 -.3 .2 .8 1 1 1
-1 -.9 -.4 .1 .5 1 1 1
-1 -1 -.8 -.2 .1 1 1 1
-1 -1 -.9 -.3 .2 .8 1 1
-1 -1 -.9 -.4 .3 .9 1 1
-1 -1 -.8 -.3 .3 .9 1 1
-1 -1 -.9 -.5 .2 .8 1 1
-1 -1 -.6 .1 .7 1 1 1
Camera
TSDF volume
Actual
surface
69

Fusing multiple depth maps
70
 Increased number of depth maps results in smooth
surface generation
Chap 4, Sec 4.3.2

Incremental Volume Update
 Road scenes are generally described
through arbitrarily long image sequence.
 3x3x1 volume of voxel grids initialised
71
Vehicle path ~1km

Incremental Volume Update
 Need to map large sequence
 3x3x1 volume of voxel grids initialised
 Incrementally add volume as the vehicle
moves out of the region
 Allows to map arbitrarily
long sequence
 Important for outdoor
scenes
72
Vehicle path ~1km

Large scale dense reconstruction
73
 Textured reconstruction.

Semantic Model Generation
 We use conditional random field framework (CRF)
74
• Each pixel is a node in a grid graph G = (V,E) having a random
variable x taking a label from label set.
• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
CRF construction[1] Image SegmentationInput Image
Chap 4, Sec 4.4.1
x
Fence 0.2
Road 0.3

Semantic Image Segmentation
 Epair- Model each pixels neighbourhood interaction.
 Encourages label consistency in
adjacent pixels and sensitive to edges.
 Contrast sensitive Potts model
 Both colour and depth images are used
 Eregion - Model behaviour of a group of pixels
 Groupings though superpixels
xi xj
Fence
Road
0
g(i,j)
Fence
Road
75
Epair

Semantic Image Segmentation - Results
 Input Images, output of our image level CRF, ground
truths.
76

Mesh Face Labelling
 A histogram of labels is
built for each mesh face
(Zf ), by projecting the
points from the face into
labelled images.
 Majority label is
considered as the label of
the face.
Chap 4, Sec 4.4.2
77

Semantic Model
Top: Left – Surface reconstruction, Right – Semantic model
Bottom: Left - input image, Right- object label set
78

Evaluation
 KITTI Object Labelled Datasets: Manually labelled images for object
class training (available for download). [1]
 The Model is projected back using the estimated camera poses to
create labelled images.
 The points in the model far away from the camera are ignored in
the projection.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticUrbanModelling/index.php Chap 4, Sec 4.5
79

Evaluation
 Metrics
 Recall = tp/(tp+fn)
 Intersection vs Union = tp/(tp+fn+fp)
80

Long Sequence
82
 1km dense reconstruction overlaid on a google map.
Path of the vehicle.

Chapter Conclusion
 Large scale dense semantic reconstruction
 Sequential volume update for accommodating long sequences
 Labelled dataset released.
 Labelling performed in image level – results in semantic
inconsistency, redundant labelling and slow overall inference
process.
 Object layout in the scene helps in labelling
83

Chapter 5 - Mesh Based Scene Labelling
84
 Motivation
 Redundancy : Individual street level image labelling – 0.5m pixels
per image to process. (scene of 100-150 images ~ 75m pixels) : Slow
 Inconsistency in labelling
 Utilizing structure through mesh connectivity.
 Solution: Perform labelling on mesh
Chap 5, Sec 5.1

Mesh labelling Framework
85
 Depth maps fused into mesh.
 Every mesh location associated
with set of image pixels across a
set of images.
 Obtain a combined appearance
score from these pixels through
a depth sensitive fusion of
scores.
 Define CRF on mesh and
perform inference on the
structure. Mesh based labelling framework

CRF over Scene Mesh
86
 We use conditional random field framework (CRF) defined
over the mesh locations.
• Each mesh vertex is a node in a graph G = (V,E), where E is
defined according to mesh neighbourhood.
• Each node is a random variable x taking a label from label set.
Chap 5, Sec 5.3

Unary Score
87
 Total energy
 Pixel class-wise classifier score given as , which are
combined as:
 ‘f’ can take ‘max’, ‘average’ or ‘weighted’.
 ‘weighted’ – weigh inversely the class scores by 3D distance of
the pixel from respective camera centre.
xi
Image pixel set from K
images (Registration)
vertex
:=
Chap 5, Sec 5.3.1

 Pairwise defined on the mesh connectivity.
 Takes the form of potts
 , with Zi and Zj are the 3D
locations of the mesh vertex i and j .
 Thus the mesh location close to each other are encouraged to take
same labels.
Pairwise
88

Experiments and results
89
 Mesh segmentation
with the corresponding
images of the scene
Chap 5, Sec 5.4

Evaluation
91
 Created ground truth mesh for evaluation [1].
[1] http://www.robots.ox.ac.uk/~tvg/projects/

Observations
92
 Improved accuracy for mesh based inference over image
based labelling and projecting the labels
 The pairwise connection respecting mesh connectivity
improves labelling
Ground Truth Unary only Unary + Pair
Image

Timing performance
93
 Labelling over mesh improves performance in inference
stage.
 Scene of 150 images of resulotion 1281x376 ≅ 75𝑚𝑙𝑛
 Mesh 704K vertex and 1.27m faces
 25x speedup in inference at our operating point
 Further speedup possible by computing classifier
response only for registered pixels to mesh.

Inference Time with varying mesh size
94
 Mesh created for the same scene with finer granularity.

 Note –ground truth mesh generated for each granularity
 Varying mesh granularity makes smaller sized mesh face
and has effect on pairwise cost
Accuracy with varying mesh granularity
95

Scene editing
96
 Labelling in 3D structure can help to categorize the 3D
regions.
 Some active scene editing ,e.g. vehicle moving on the
road.
Chap 5, Sec 5.4

Chapter Conclusions
98
 Present a mesh based inference for scene labelling.
 Inference on mesh provides an accurate and faster approach
towards scene labelling.
 Presented a classifier score combination method which
improves accuracy.
 Upto 25x faster in inference stage for outdoor scenes.
 Applications – scene editing can be performed once scene is
labelled.
 However the mesh representation is limiting for various
robotic tasks, which we try to overcome in next chapter.

Chapter 6 - Hierarchical CRF on an Octree Graph
99
 Computer vision – attempts to recognise scene has been studied
exhaustively.
 Robotics – efficient/accurate 3D representation of scene for
various robotic tasks, but little for understanding semantics.
 Aim - Join the two hands towards recognition in an efficient
representation, and present a method which
 Performs jointly recognition and infers occupancy.
 Uses hierarchal constraints to perform scene labelling
 Uses an efficient 3D representation for determining occupied, free and
unknown area.
Chap 6, Sec 6.1

Good 3D representation
100
 Why
 Needed for further processing tasks
 Robotics domain – mapping, grasping/manipulation, navigation
 Graphics domain – efficient rendering over graphics processing unit and
visualization
 What
 Should map accurately
 Occupied: Objects present in the world,
 Free: required for collision avoidance, path planning.
 Unmapped: unknown areas in the scene need to be avoided.
 Efficiency: Any 3D volume requires to be identified as
free/occupied/unmapped efficiently.

Existing 3d representation
101
 Storing 3D measurements from sensors through point clouds
– cannot map free and unknown area 
 Mesh – same limitations as pt. clouds 
 Stixels/Height maps/2.5D : one height value in a 2D grid, but
free area not accurately mapped 
 Fixed sized grid of voxels: Voxels not indexed which makes it
inefficient 
 Octree based volumetric representation – Introduced more
than three decade back, represents accurately 3d space,
efficient indexing of volume 

Octomap - representation
102
 Octree representatation
 Every voxels/volume divided into 8 subvolume, allowing fast
indexing of voxels
 Advantageous in comparison to point clouds, surface maps,
elevation/2.5d representations
 Used widely across computer science
 Hardware friendly (cpu, gpu, fpga)
 Octomap [a] proposed in 2013
 Probabilistic representation of occupied, free and unknown regions
 Based on octree based 3d representation
 Demonstrated to map large areas though fusion of depth estimates.
[a] O Armin Hornung, ctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.

Multi-resolution approaches in Computer vision
103
 Multi-resolution approach used for recognition,
classification detection
 Information at pixel level, pair of pixels or group of pixels
combined together
 Robust PN model [1] - penalised label inconsistency over a
group of pixels.
 Grouping determined through unsupervised image segmentation
 Here we extend the multi-resolution image based
classification approach to 3D volume indexed through an
octree
[1], P. Kohli et at. Robust Higher Order Potentials for Enforcing Label Consistency

Semantic Octree - framework
104
 Input stereo images
Chap 6, Sec 6.3

105
 Generate point clouds and class hypothesis for every pixel
Chap 6, Sec 6.3

106
 Fuse into an octree through estimated camera
 Octree – each volume subdivided in 8 sub-volumes
 Leaf- nodes (xi) are the smallest sized voxels
 Any internal node (xc) gives a natural grouping of 3D
space
Chap 6, Sec 6.3

 Perform inference over 3D voxels to give labelled scene.
107
Chap 6, Sec 6.3

CRF graph on Octree voxels
 Octree divides the space into subvolumes indexed through tree
with nodes
 τint : Internal nodes in the tree (xc)
 τleaf : leaf level voxels (xi)
 Random variable for every leaf voxel
 Every internal node is associated with a set of leaf voxels
resulting in a clique
 Label set defined as
 Final energy :
108
Chap 6, Sec 6.3

 Octree Volume update
 All voxels initially set unknown and occupancy probability P(xi) = 0.5 and
log odds
 For each 3D point (obtained from stereo pairs), voxels’ log odds updated in
a ray casting manner
 Log odds are updated for all 3D points for every stereo pairs
 Final occupancy probability obtained as
Unary score for leaf voxels
109
Chap 6, Sec 6.3.1

Unary score for leaf voxels
 Each occupied voxel xi is associated with a set of 3D pts
 The corresponding image pixels denoted as
 Pixel scores combined together
 Given the initial occupancy P(xi), the unary is given as:
 Thus, for every initially estimated occupied voxels have low cost for
free label and vice verca
110
Chap 6, Sec 6.3.1

Hierarchical tree potential
 Robust PN potential applied over hierarchical groupings of voxels
 Penalise label inconsistency within the grouping of voxels
 Takes the form
 Maximum cost truncated to ϒmax
 Grouping of voxels correspond to internals nodes in the octree
111
Chap 6, Sec 6.3.2

Experiments
112
 Octree defined of 16 levels
 Smallest resolution of voxels = (8x8x8)cm3
 Maximum mapped volume (216 x 8 )3cm ~ 5.243 km3
 Hierarchical grouping of voxels corresponding to internal nodes
13-15 considered

Results
113
 Higherarchial grouping while inference vs leaf level voxel
labelling (much sparser)
Chap 6, Sec 6.4

 Quantitative evaluation :
 Performed by projecting into image domain
 Observations
 Small objects tend to get decimated due to octree quantization hence reduced
accuracy
 Mesh based representation better in representing surface.
 Non-uniform Grouping of volumes (k-d tree) can be used to improve results
Results
114

Occupancy mapping
115
 Grouping of voxels hierarchically increases the occupied
volume reducing the sparsity

Chapter Conclusion
116
 A method to infer jointly object class labels and
occupancy mapping proposed
 Efficient representation of 3D space for further
operations like navigation and manipulation
 Octree poses a quantization error which can be
approached through grouping of volumes through k-d
tree

Thesis - Conclusions
117
 This thesis covered the aspects of scene understanding
and proposed solutions for dense semantic mapping and
reconstruction
 Chapter 3 – Large scale Dense semantic mapping
 Overhead semantic view of an urban
region
 Experiments to generate ~15km map
 One of the first large scale semantic map
 Presented as oral in IEEE IROS 2012
Chap 7, Sec 7.1

Thesis - Conclusions
118
 Chapter 4 – Dense semantic reconstruction
 Dense semantic reconstruction from kms of
stereo images.
 Online volumetric reconstruction to
accommodate arbitrarily long road scenes.
 Presented as oral in IEEE ICRA 2013
 Chapter 5 – Mesh based inference for scene labelling
 Improved labelling accuracy (pairwise connections
respect mesh connectivity) and consistency.
 Depth sensitive classifier fusion.
 25x faster in inference time
 Presented as poster in CVPR 2013

Conclusions
119
 Chapter 6 – Hierarchical CRF on an Octree Graph
 Unified framework to determine 3D
volume occupancy and with object class
labels in the scene.
 Efficient representation
 Robust PN potential over octree volumes
 Datasets (available publicly)
 Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
 Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset

Way forward
120
 Transfer learning – so many datasets with so many labellings. Should aim to
learn from multiple source and apply in test cases.
 Life long learning – an agent needs to identify the object irrespective of
changes in environment
 Exploit High level attributes
 Need to investigate for an end-to-end real-time pipeline for dense
recognition, reconstruction
 Exploit scene dynamics – DVS (dynamic vision systems) give only modified
pixels through efficient sensors.
Chap 7, sec 7.2

Thank you
121
 Acknowledgements
 Supervisors: Philip Torr and David Duce
 Thesis Examiners: Gabriel Brostow and Nigel Crook
 Collaborators: Paul Sturgess, Lubor Ladicky, Ali Shahrokni, Eric
Greeveson, Julien Valentin, Ziming Zhang, Johnathan Warrell, Chris
Russell, Yalin Bastanlar, William Clocksin, Vibhav Vineet, Mike Sapi.

References
122
 Lubor Ladicky et. al. Associative hierarchical crfs for object class image
segmentation. ICCV, 2009, PAM13
 Pushmeet Kohli et. Al Robust Higher Order Potentials for Enforcing Label
Consistency, IJCV 09
 Paul Sturgess et. Al. Combining Appearance and Structure from Motion
Features for Road Scene Understanding, BMVC 09
 Lubor Ladicky et. al. Joint optimisation for object class segmentation and
dense stereo reconstruction. BMVC, 2010, IJCV 12
 Richard A. Newcombe et. al. Kinectfusion: Real-time dense surface mapping
and tracking. In IEEE ISMAR 2011.

Semantic Mapping of Road Scenes

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Semantic Mapping of Road Scenes

Similar a Semantic Mapping of Road Scenes (20)

Último

Último (20)

Semantic Mapping of Road Scenes