Semantic mapping of road scenes, PhD thesis. The main aim of the thesis is to investigate and propose solutions to the scene understanding problem of finding 'what' objects are present in the world and 'where' are they located.
1. P H D T H ES I S D E F E N C E
S U N A N D O S E N G U P TA
OX FO R D B RO O K ES U N I V E RS I T Y
Semantic Mapping of Road
Scenes
1
Supervisors – Prof. Philip Torr and Prof. David Duce
16/06/2014
2. Outline
Introduction
The Labelling problem
Dense Semantic Map (chap. 3)
Dense 3D Semantic Modelling (chap. 4)
Mesh Based Inference (chap. 5)
Hierarchical CRF on an Octree Graph (chap. 6)
Conclusion
2
3. Objective
Holy grail of computer vision
What are the objects present in the scene
Where are they located
Biological vision performs these two activities through human
visual perception.
Computers ( or humans through them) try to solve the same
issue through an information processing route.
Gather sensor data (images, gps, imu,…)
Represent them into a map
Recognise objects in the map
This thesis aims to look in this very problem and propose
solution towards addressing it.
3
Can happen simultaneously or
sequentially
Chap 1, Sec 1.2
4. Objective - Visually
Input image of a street scene, person cleaning, some cars in the
background, and buildings in the horizon.
Place the appropriate objects at right distance from camera in correct size.
4
Chap 1, Sec 1.2
Image courtesy: Antonio Torallba,
http://6.869.csail.mit.edu/fa13/
5. Why it is important
5
Numerous applications from robotics, entertainment,
engineering, medical…
Self driving cars
Engineering
Robots for manipulation
Humanoids
Assistive vision for impaired
Entertainment
Aim for a vision based system to produce a semantically
consistent scene from visual inputs
Chap 1, Sec 1.2
6. Essentially a hard problem
6
Large variation in the image formulation
Scene Variation
Varying scene type and geometry
Object level variation
Large number of object classes
Individual Object location and orientation
Object shape and appearance
Depth/occlusions
Illumination
Shadows
Motion blur
Chap 1, Sec 1.2
7. Thesis - Contributions
7
This thesis provides solutions for large scale outdoor
urban semantic mapping.
Large scale Dense overhead semantic mapping.
Semantic from local images fused to
form a global ground plane map
First attempt to generate such map.
~15km of semantic mapping
One of the first large scale semantic map
Presented as oral in IEEE IROS 2012
Chap 1, Sec 1.3
8. Thesis - Contributions
8
Dense semantic reconstruction
Dense 3D semantic reconstruction from kms of
stereo images.
Online sequential volumetric reconstruction to
accommodate arbitrarily long road scenes.
Presented as oral in IEEE ICRA 2013.
Mesh based inference for scene labelling
Improved labelling accuracy and consistency.
Depth sensitive classifier fusion.
25x faster in inference time (than image labelling).
Presented as poster in CVPR 2013.
Chap 1, Sec 1.3
9. Thesis - Contributions
9
Hierarchical CRF on an Octree Graph
Unified framework to determine free and
occupied regions in a scene along with
object class labels.
Robust PN potential over octree volumes
Datasets (available online)
Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset
Chap 1, Sec 1.3
10. Publications
10
Related to Thesis
S. Sengupta, P. Sturgess, L. Ladicky, P. H. S. Torr: Automatic dense visual semantic mapping from street-
level imagery. IEEE/RSJ IROS 2012 (Chapter 3 )
S. Sengupta, E. Greveson, A. Shahrokni, P. H.S. Torr: Urban 3D Semantic Modelling Using Stereo Vision, IEEE
ICRA, 2013 (Chapter 4 )
S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes, IEEE CVPR, 2013. ( *Joint first authors, Chapter 5.)
S. Sengupta*, J. Valentin*, J. Warrell, A. Shahrokni, P. H.S. Torr: Mesh Based Semantic Modelling for Indoor
and Outdoor Scenes. SUNw: Scene Understanding Workshop. Held in conjunction with CVPR , 2013.
(*Joint first authors, Invited paper )
Datasets
Yotta Labeled road scene dataset.
KITTI object labelling. (Datasets available at http://www.robots.ox.ac.uk/~tvg/projects )
Other publications
Z. Zhang, P. Sturgess, S. Sengupta, N. Crook, P. H.S. Torr: Efficient discriminative learning of parametric
nearest neighbor classifiers, IEEE CVPR, 2012
L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr: Joint Optimization
for Object Class Segmentation and Dense Stereo Reconstruction. IJCV 2012 (Invited paper)
L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar, W. F. Clocksin, P. H. S. Torr : Joint Optimisation
for Object Class Segmentation and Dense Stereo Reconstruction. BMVC 2010 (BMVA Best science paper )
Chap 1, Sec 1.4
11. Multiple computer vision task modelled as labelling problem
Assign a discrete set of sites a label from the set
E.g. pixel associated with an object class label
The labelling problem
11
Chap 2, Sec 2.1
12. 12
What are the Labels
Discrete or continuous
Discrete
Image pixels assigned to object classes like Cars, humans, buildings, pavement,
trees etc.
Foreground/background labels
Indoor/outdoor labels…
Continuous range
Depth: Pixels can take a set of disparity labels
Optical flow
Chap 2, Sec 2.1
13. 13
CRF-Framework
Set of random variables corresponding to each
pixel and the label set
Aim is to associate every random variable with a label
The conditional probability of the labelling x given the data D,
Gibbs free energy is given as
MAP labelling x*of the random field is defined by
},...,,{ 21 NxxxX
Chap 2, Sec 2.2
14. 14
• The pixel labelling problem can be formulated as an pair-
wise/higher-order CRF problem whose energy is
• The image is represented as a graph: G = {V,E}
• V is the total set of nodes of the graph
• Ni represents the neighbourhood of the node i
• The unary potential measures the cost of assigning
particular label to the pixel
• Generated using the result of a boosted classifier over a
region about each pixel
CRF modelling for image labelling
Chap 2, Sec 2.2
15. 15
• The pairwise term or the smoothness term depends on
the inter-pixel observations, should be discontinuity preserving
across the object boundaries
• Takes Potts form
• where
• Higher order potentials defined on a group of pixels conditionally
dependant on each other.
• Robust PN, Hierarchical PN models [1]
• Final labelling obtained through minimising the Energy E
CRF modelling for image labelling
Chap 2, Sec 2.2
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
16. 16
Quite hard
The energy minimization is quite hard (large number of
random variables with interconnections).
Possible solution – simulated annealing, ICM, but slow.
Approximate algorithms exist for certain energy functions
for a multi-label problem.
Move-making algorithms[1]
α – expansion: for each α, allow the random variables to retain existing label or
change to the label α, using graph cuts.
αβ swap: considers a pair of label at each iteration, such that all pixels change
their label from β to α though graph cuts.
Chap 2, Sec 2.2[1]Boykov et.al. Fast Approximate Energy Minimization via Graph Cuts, ICCV
17. Stereo
Early attempts to explain depth begins in the renaissance
Essentially the images subtended at the left and right eyes can
be used to obtain a disparity/depth map
17
Stereo sketch by Jacopo Chimenti da Empoli,
Italy , around 1600 AD
Leonardo da Vinci, Optical Studies
on Binocular vision
Chap 2, Sec 2.3
18. Depth from Sequence of images
18
Structure from motion for sparse 3d reconstruction.[1]
Visual hull/Silhouettes based volume carving[2]
Elevation/Height/2.5D maps[3]
Tsdf/Voxel based Fusion[4]
Chap 2, Sec 2.3
[1] Sameer A. et.al. Building rome in a day. Commun. ACM, 2011.
[2] Friedrich E. Al. Stixmentation - probabilistic stixel based traffic scene labeling. BMVC 12
[3] Y. Furukawa et.al. Carved visual hulls for image-based modeling. IJCV, 2009
[4] Richard N. et. al. Kinectfusion: Real-time dense surface mapping and tracking. In IEEE ISMAR 2011.
19. Dense Semantic Mapping
Generate an overhead view of an urban region.
Label every pixel in the Map View is associated with an
object class label
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
19
Chap 3, Sec 3.1
20. Street images captured inexpensively from vehicle with
multiple mounted camera[1].
[1] Yotta. DCL, “Yotta dcl case studies,” Available: http://www.yottadcl.com/surveys/case-studies/
20
Dense Semantic Mapping
21. Semantic Mapping Framework
Semantic mapping framework comprises of two stages
Street level Images
acquisition
21
Chap 3, Sec 3.3
22. Semantic Mapping Framework
Semantic mapping framework comprises of two stages
Semantic Image Segmentation at street level.
Street level Images
acquisition
Image
Segmentation
22
23. Semantic mapping framework comprises of two stages
Semantic Image Segmentation at street level.
Ground Plane Labelling at a global level.
First attempt to do an overhead mapping from street
level images.
Semantic Mapping Framework
Street level Images
acquisition
Image
Segmentation
Ground plane
labelling
23
24. Street-level Image Segmentation
Label every pixels in the image with object class labels
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
Input Output
Raw Image Labelled Image
Automatic
Labeller
Object Class Labels
24
Chap 3, Sec 3.3.1
25. Street-level Image Segmentation
25
CRF based image labeller
Each pixel is a node in a grid graph G = (V,E).
Each node is a random variable x taking a label from
label set.
CRF
construction
Final SegmentationInput Image
26. Semantic Image Segmentation - CRF
26
Total energy
Optimal labelling given as
Cc
cc
NjVi
jiij
Vi
ii
i
xxxE )(),()()(
,
xx
Epix Epair
Eregion
27. Total energy E = Epix + Epair + Eregion
Epix - Model individual pixel’s cost of taking a label.
Computed via the dense boosting approach
Multi feature variant of texton boost[1]
Semantic Image Segmentation - CRF
27
x
Car 0.2
Road 0.3
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
28. Total energy E = Epix + Epair + Eregion
Epair - Model each pixel neighbourhood interactions.
Encourages label consistency in adjacent pixels
Sensitive to edges in images.
Contrast sensitive Potts model
xi xj
CarCar
Road
0
g(i,j) Road
Semantic Image Segmentation - CRF
28
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
Epair
29. Total energy E = Epix + Epair + Eregion
Eregion - Model behaviour of a group of pixels.
Classify a region
Encourages all the pixels in a region
to take the same label.
Group of pixels given by multiple meanshift segmentations
Semantic Image Segmentation - CRF
29
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
30. 30
Energy minimisation using alpha-expansion algorithm[1]
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
Input Image Road Expansion
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99
30
Semantic Image Segmentation - CRF
31. 31
Input Image Building Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 99
31
Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
32. Input Image Sky Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9932
32
Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
33. Input Image Pavement Expansion
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9933
33
Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
34. Input Image Final solution
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
[1] Fast Approximate Energy Minimization via Graph Cuts. Yuri Boykov et al. ICCV 9934
34
Solved using alpha-expansion algorithm[1]
Semantic Image Segmentation - CRF
35. Ground Plane Labelling
Combine many labellings from street level imagery.
Automatic
Labeller
Output
Labelled Ground PlaneStreet Level
labellings
Input
35
36. Ground Plane CRF
A CRF defined over the ground plane.
Each ground plane pixel (zi) is a random variable taking a
label from the label set.
Energy for ground plane CRF is
Z
36
g
pair
g
pix
g
EEZE )(
Chap 3, Sec 3.3.2
38. Ground Plane Pixel Cost
Homography Road Pavement Post/Pole
K
X
Z
A ground plane region is estimated.
38
38
39. • Each point in the image projects to a unique point on the
ground plane.
– Creating a homography
K
X
Z
Ground Plane Pixel Cost
Homography Road Pavement Post/Pole
39
39
40. • The image labelling is mapped to the ground plane
– via the homography.
K
X
Z
Ground Plane Pixel Cost
Ground plane
Pixel histograms
Homography Road Pavement Post/Pole
40
40
41. • Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
Ground Plane Pixel Cost
41
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
41
41
42. • Labels projected from many views are combined in a
histogram.
• The normalised histogram gives the naïve probability of
the ground plane pixel taking a label.
Ground Plane Pixel Cost
K
X
Z Ground plane
Pixel histogramsHomography Road Pavement Post/Pole
42 Chap 3, Sec 3.3.2
42
43. Ground Plane labelling
Histogram is built for every ground plane pixel giving Eg
pix
Pairwise cost (Eg
pair) added to induce smoothness
Contrast sensitive potts model
Z
43
49. Experiments - Dataset
Subset of the images captured by the van
~15 km of track, 8000 images from each camera.
Pixel-level labelled ground truth images. Dataset
available[1].
13 object categories –
Training - 44 images, testing - 42 images.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticMap/index.php
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
49
Chap 3, Sec 3.4.1
50. SIS Results
Input Images, output of our image level CRF, ground truths.
Used Automatic Labelling environment[1]
[1] The Automatic Labelling Environment, L Ladicky, PHS Torr. Code available
http://cms.brookes.ac.uk/staff/PhilipTorr/ale.htm
BuildingRoadTreeVegetation FenceSignage
SkyPavement Car Pedestrian Bollard Shop Sign Post
50
Input
Semantic
segmentation
Ground Truth
52. Ground plane Map Evaluation
52
Street Images
Back-projected
Map results
Ground Truth
• We back-project the ground plane map into image domain
and evaluate the results.
• Global pixel accuracy of 83%
52
52
54. Chapter Summary
Presented a method to generate
overhead view semantic mapping.
Experiments on large tracks (~15km)
which can be scaled up to country
wide mapping
Dataset available[1].
However a flat world assumption
does not represent the 3D scene
properly – our aim is to perform a
semantic metric reconstruction of
the world.
[1] http://cms.brookes.ac.uk/research/visiongroup/projects/SemanticMap/index.php
54
55. Urban 3D Semantic Modelling Using Stereo Vision
55
[1]
Input Stereo image Sequence Dense 3D Semantic Model
Given a sequence of stereo images we generate a
dense 3D, semantic model
Chap 4, Sec 4.1
62. Camera Estimation
Use the feature tracks to
estimate camera poses.
Use bundle adjustment
[a]Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Suite CVPR 2012
62
65. Depth-Map Estimation
Semiglobal block matching[1] for disparity estimation
Per-pixel depth computed as z = B × f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baseline
f - Focal Length
d – pixel disparity
65
66. Depth Fusion
Depth estimates are fused using
camera poses.
Fused into truncated signed
distance (TSDF) volumetric
representation[1].
Surface mesh generated though
marching tetrahedra algorithm.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
Chap 4, Sec 4.3.2
66
67. Depth fusion using TSDF Volume [1]
Entire space divided into grids of voxels.
For each voxel compute the truncated signed distance.
+ve increasing when it lies in the free space,
-ve when it lies behind the surface
zero when lies on the surface
Performed for all depth maps.
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building
Complex Models from Range Images Siggraph 96.
67
70. Fusing multiple depth maps
70
Increased number of depth maps results in smooth
surface generation
Chap 4, Sec 4.3.2
71. Incremental Volume Update
Road scenes are generally described
through arbitrarily long image sequence.
3x3x1 volume of voxel grids initialised
71
Vehicle path ~1km
72. Incremental Volume Update
Need to map large sequence
3x3x1 volume of voxel grids initialised
Incrementally add volume as the vehicle
moves out of the region
Allows to map arbitrarily
long sequence
Important for outdoor
scenes
72
Vehicle path ~1km
74. Semantic Model Generation
We use conditional random field framework (CRF)
74
• Each pixel is a node in a grid graph G = (V,E) having a random
variable x taking a label from label set.
• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for
object class image segmentation,” in ICCV, 2009.
CRF construction[1] Image SegmentationInput Image
Chap 4, Sec 4.4.1
x
Fence 0.2
Road 0.3
75. Semantic Image Segmentation
Epair- Model each pixels neighbourhood interaction.
Encourages label consistency in
adjacent pixels and sensitive to edges.
Contrast sensitive Potts model
Both colour and depth images are used
Eregion - Model behaviour of a group of pixels
Groupings though superpixels
xi xj
Fence
Road
0
g(i,j)
Fence
Road
75
Epair
77. Mesh Face Labelling
A histogram of labels is
built for each mesh face
(Zf ), by projecting the
points from the face into
labelled images.
Majority label is
considered as the label of
the face.
Chap 4, Sec 4.4.2
77
78. Semantic Model
Top: Left – Surface reconstruction, Right – Semantic model
Bottom: Left - input image, Right- object label set
78
79. Evaluation
KITTI Object Labelled Datasets: Manually labelled images for object
class training (available for download). [1]
The Model is projected back using the estimated camera poses to
create labelled images.
The points in the model far away from the camera are ignored in
the projection.
[1] http://www.robots.ox.ac.uk/~tvg/projects/SemanticUrbanModelling/index.php Chap 4, Sec 4.5
79
82. Long Sequence
82
1km dense reconstruction overlaid on a google map.
Path of the vehicle.
83. Chapter Conclusion
Large scale dense semantic reconstruction
Sequential volume update for accommodating long sequences
Labelled dataset released.
Labelling performed in image level – results in semantic
inconsistency, redundant labelling and slow overall inference
process.
Object layout in the scene helps in labelling
83
84. Chapter 5 - Mesh Based Scene Labelling
84
Motivation
Redundancy : Individual street level image labelling – 0.5m pixels
per image to process. (scene of 100-150 images ~ 75m pixels) : Slow
Inconsistency in labelling
Utilizing structure through mesh connectivity.
Solution: Perform labelling on mesh
Chap 5, Sec 5.1
85. Mesh labelling Framework
85
Depth maps fused into mesh.
Every mesh location associated
with set of image pixels across a
set of images.
Obtain a combined appearance
score from these pixels through
a depth sensitive fusion of
scores.
Define CRF on mesh and
perform inference on the
structure. Mesh based labelling framework
86. CRF over Scene Mesh
86
We use conditional random field framework (CRF) defined
over the mesh locations.
• Each mesh vertex is a node in a graph G = (V,E), where E is
defined according to mesh neighbourhood.
• Each node is a random variable x taking a label from label set.
Chap 5, Sec 5.3
87. Unary Score
87
Total energy
Pixel class-wise classifier score given as , which are
combined as:
‘f’ can take ‘max’, ‘average’ or ‘weighted’.
‘weighted’ – weigh inversely the class scores by 3D distance of
the pixel from respective camera centre.
xi
Image pixel set from K
images (Registration)
vertex
:=
Chap 5, Sec 5.3.1
88. Pairwise defined on the mesh connectivity.
Takes the form of potts
, with Zi and Zj are the 3D
locations of the mesh vertex i and j .
Thus the mesh location close to each other are encouraged to take
same labels.
Pairwise
88
92. Observations
92
Improved accuracy for mesh based inference over image
based labelling and projecting the labels
The pairwise connection respecting mesh connectivity
improves labelling
Ground Truth Unary only Unary + Pair
Image
93. Timing performance
93
Labelling over mesh improves performance in inference
stage.
Scene of 150 images of resulotion 1281x376 ≅ 75𝑚𝑙𝑛
Mesh 704K vertex and 1.27m faces
25x speedup in inference at our operating point
Further speedup possible by computing classifier
response only for registered pixels to mesh.
94. Inference Time with varying mesh size
94
Mesh created for the same scene with finer granularity.
95. Note –ground truth mesh generated for each granularity
Varying mesh granularity makes smaller sized mesh face
and has effect on pairwise cost
Accuracy with varying mesh granularity
95
96. Scene editing
96
Labelling in 3D structure can help to categorize the 3D
regions.
Some active scene editing ,e.g. vehicle moving on the
road.
Chap 5, Sec 5.4
98. Chapter Conclusions
98
Present a mesh based inference for scene labelling.
Inference on mesh provides an accurate and faster approach
towards scene labelling.
Presented a classifier score combination method which
improves accuracy.
Upto 25x faster in inference stage for outdoor scenes.
Applications – scene editing can be performed once scene is
labelled.
However the mesh representation is limiting for various
robotic tasks, which we try to overcome in next chapter.
99. Chapter 6 - Hierarchical CRF on an Octree Graph
99
Computer vision – attempts to recognise scene has been studied
exhaustively.
Robotics – efficient/accurate 3D representation of scene for
various robotic tasks, but little for understanding semantics.
Aim - Join the two hands towards recognition in an efficient
representation, and present a method which
Performs jointly recognition and infers occupancy.
Uses hierarchal constraints to perform scene labelling
Uses an efficient 3D representation for determining occupied, free and
unknown area.
Chap 6, Sec 6.1
100. Good 3D representation
100
Why
Needed for further processing tasks
Robotics domain – mapping, grasping/manipulation, navigation
Graphics domain – efficient rendering over graphics processing unit and
visualization
What
Should map accurately
Occupied: Objects present in the world,
Free: required for collision avoidance, path planning.
Unmapped: unknown areas in the scene need to be avoided.
Efficiency: Any 3D volume requires to be identified as
free/occupied/unmapped efficiently.
101. Existing 3d representation
101
Storing 3D measurements from sensors through point clouds
– cannot map free and unknown area
Mesh – same limitations as pt. clouds
Stixels/Height maps/2.5D : one height value in a 2D grid, but
free area not accurately mapped
Fixed sized grid of voxels: Voxels not indexed which makes it
inefficient
Octree based volumetric representation – Introduced more
than three decade back, represents accurately 3d space,
efficient indexing of volume
102. Octomap - representation
102
Octree representatation
Every voxels/volume divided into 8 subvolume, allowing fast
indexing of voxels
Advantageous in comparison to point clouds, surface maps,
elevation/2.5d representations
Used widely across computer science
Hardware friendly (cpu, gpu, fpga)
Octomap [a] proposed in 2013
Probabilistic representation of occupied, free and unknown regions
Based on octree based 3d representation
Demonstrated to map large areas though fusion of depth estimates.
[a] O Armin Hornung, ctoMap: An efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots, 2013.
103. Multi-resolution approaches in Computer vision
103
Multi-resolution approach used for recognition,
classification detection
Information at pixel level, pair of pixels or group of pixels
combined together
Robust PN model [1] - penalised label inconsistency over a
group of pixels.
Grouping determined through unsupervised image segmentation
Here we extend the multi-resolution image based
classification approach to 3D volume indexed through an
octree
[1], P. Kohli et at. Robust Higher Order Potentials for Enforcing Label Consistency
105. Semantic Octree - framework
105
Generate point clouds and class hypothesis for every pixel
Chap 6, Sec 6.3
106. Semantic Octree - framework
106
Fuse into an octree through estimated camera
Octree – each volume subdivided in 8 sub-volumes
Leaf- nodes (xi) are the smallest sized voxels
Any internal node (xc) gives a natural grouping of 3D
space
Chap 6, Sec 6.3
107. Perform inference over 3D voxels to give labelled scene.
Semantic Octree - framework
107
Chap 6, Sec 6.3
108. CRF graph on Octree voxels
Octree divides the space into subvolumes indexed through tree
with nodes
τint : Internal nodes in the tree (xc)
τleaf : leaf level voxels (xi)
Random variable for every leaf voxel
Every internal node is associated with a set of leaf voxels
resulting in a clique
Label set defined as
Final energy :
108
Chap 6, Sec 6.3
109. Octree Volume update
All voxels initially set unknown and occupancy probability P(xi) = 0.5 and
log odds
For each 3D point (obtained from stereo pairs), voxels’ log odds updated in
a ray casting manner
Log odds are updated for all 3D points for every stereo pairs
Final occupancy probability obtained as
Unary score for leaf voxels
109
Chap 6, Sec 6.3.1
110. Unary score for leaf voxels
Each occupied voxel xi is associated with a set of 3D pts
The corresponding image pixels denoted as
Pixel scores combined together
Given the initial occupancy P(xi), the unary is given as:
Thus, for every initially estimated occupied voxels have low cost for
free label and vice verca
110
Chap 6, Sec 6.3.1
111. Hierarchical tree potential
Robust PN potential applied over hierarchical groupings of voxels
Penalise label inconsistency within the grouping of voxels
Takes the form
Maximum cost truncated to ϒmax
Grouping of voxels correspond to internals nodes in the octree
111
Chap 6, Sec 6.3.2
112. Experiments
112
Octree defined of 16 levels
Smallest resolution of voxels = (8x8x8)cm3
Maximum mapped volume (216 x 8 )3cm ~ 5.243 km3
Hierarchical grouping of voxels corresponding to internal nodes
13-15 considered
114. Quantitative evaluation :
Performed by projecting into image domain
Observations
Small objects tend to get decimated due to octree quantization hence reduced
accuracy
Mesh based representation better in representing surface.
Non-uniform Grouping of volumes (k-d tree) can be used to improve results
Results
114
116. Chapter Conclusion
116
A method to infer jointly object class labels and
occupancy mapping proposed
Efficient representation of 3D space for further
operations like navigation and manipulation
Octree poses a quantization error which can be
approached through grouping of volumes through k-d
tree
117. Thesis - Conclusions
117
This thesis covered the aspects of scene understanding
and proposed solutions for dense semantic mapping and
reconstruction
Chapter 3 – Large scale Dense semantic mapping
Overhead semantic view of an urban
region
Experiments to generate ~15km map
One of the first large scale semantic map
Presented as oral in IEEE IROS 2012
Chap 7, Sec 7.1
118. Thesis - Conclusions
118
Chapter 4 – Dense semantic reconstruction
Dense semantic reconstruction from kms of
stereo images.
Online volumetric reconstruction to
accommodate arbitrarily long road scenes.
Presented as oral in IEEE ICRA 2013
Chapter 5 – Mesh based inference for scene labelling
Improved labelling accuracy (pairwise connections
respect mesh connectivity) and consistency.
Depth sensitive classifier fusion.
25x faster in inference time
Presented as poster in CVPR 2013
119. Conclusions
119
Chapter 6 – Hierarchical CRF on an Octree Graph
Unified framework to determine 3D
volume occupancy and with object class
labels in the scene.
Efficient representation
Robust PN potential over octree volumes
Datasets (available publicly)
Yotta labelled dataset: multiview street images (urban, rural,
highway) containing 8000+ images, with object class labellings
Kitti Labelled dataset: Object class labelling for publicly available
KITTI dataset
120. Way forward
120
Transfer learning – so many datasets with so many labellings. Should aim to
learn from multiple source and apply in test cases.
Life long learning – an agent needs to identify the object irrespective of
changes in environment
Exploit High level attributes
Need to investigate for an end-to-end real-time pipeline for dense
recognition, reconstruction
Exploit scene dynamics – DVS (dynamic vision systems) give only modified
pixels through efficient sensors.
Chap 7, sec 7.2
121. Thank you
121
Acknowledgements
Supervisors: Philip Torr and David Duce
Thesis Examiners: Gabriel Brostow and Nigel Crook
Collaborators: Paul Sturgess, Lubor Ladicky, Ali Shahrokni, Eric
Greeveson, Julien Valentin, Ziming Zhang, Johnathan Warrell, Chris
Russell, Yalin Bastanlar, William Clocksin, Vibhav Vineet, Mike Sapi.
122. References
122
Lubor Ladicky et. al. Associative hierarchical crfs for object class image
segmentation. ICCV, 2009, PAM13
Pushmeet Kohli et. Al Robust Higher Order Potentials for Enforcing Label
Consistency, IJCV 09
Paul Sturgess et. Al. Combining Appearance and Structure from Motion
Features for Road Scene Understanding, BMVC 09
Lubor Ladicky et. al. Joint optimisation for object class segmentation and
dense stereo reconstruction. BMVC, 2010, IJCV 12
Richard A. Newcombe et. al. Kinectfusion: Real-time dense surface mapping
and tracking. In IEEE ISMAR 2011.