Fisheye-Omnidirectional View in Autonomous Driving III

Fisheye/Omnidirectional View in
Autonomous Driving III
YuHuang
Yu.huang07@gmail.com
Sunnyvale,California

Outline
• DS-PASS: Detail-Sensitive Panoramic Annular Semantic Segmentation
through SwaftNet for Surrounding Sensing
• The OmniScape Dataset (ICRA’2020)
• Universal Semantic Segmentation for Fisheye Urban Driving Images
• Vehicle Re-ID for Surround-view Camera System
• SynDistNet: Self-Supervised Monocular Fisheye Camera Distance
Estimation Synergized with Semantic Segmentation for Autonomous
Driving
• Towards Autonomous Driving: a Multi-Modal 360 Perception Proposal

DS-PASS: Detail-Sensitive Panoramic Annular Semantic
Segmentation through SwaftNet for Surrounding Sensing
• In this paper, propose a network adaptation framework to achieve Panoramic
Annular Semantic Segmentation (PASS), which allows to re-use conventional
pinhole-view image datasets, enabling modern segmentation networks to
comfortably adapt to panoramic images.
• Specifically, adapt our proposed SwaftNet to enhance the sensitivity to details by
implementing attention-based latera connections between the detail-critical
encoder layers and the context-critical decoder layers. It benchmarks the
performance of efficient segmenters on panoramic segmentation with an
extended PASS dataset, demonstrating that the proposed realtime SwaftNet
outperforms state-of-the-art efficient networks.
• Furthermore, assess real-world performance when deploying the Detail-Sensitive
PASS (DS-PASS) system on a mobile robot and an instrumented vehicle, as well as
the benefit of panoramic semantics for visual odometry, showing the robustness
and potential to support diverse navigational applications.

Panoramic annular semantic segmentation. On the left: raw annular image; First row on the right:
unfolded panorama; Second row: panoramic segmentation of the baseline method, the
classification heatmap of pedestrian is blurry; Third row: detail-sensitive panoramic segmentation
of the proposed method, the heatmap and semantic map are detail-preserved.

The proposed framework for panoramic
annular semantic segmentation. Each
feature model (corresponding to the single
feature model like encoder in conventional
architectures) is responsible for predicting
the semantically-meaningful high-level
feature map of a panorama segment while
interacting with the neighboring ones
through cross-segment padding (indicated
by the dotted arrows). Fusion model
incorporates the feature maps and
completes the panoramic segmentation.
The proposed architecture follows the single-
scale model of SwiftNet, based on an U-
shape structure like Unet and LinkNet.

The proposed architecture with attention-based lateral connections to blend semantically-
rich deep layers with spatially-detailed shallow layers. The down-sampling path with the SPP
module (encoder) corresponds to the feature model in last figure, while the up-sampling path
(decoder) corresponds to the fusion model

The OmniScape Dataset
• Despite the utility and benefits of omnidirectional images in robotics and automotive applications,
there are no datasets of omnidirectional images available with semantic segmentation, depth map,
and dynamic properties.
• This is due to the time cost and human effort required to annotate ground truth images.
• This paper presents a framework for generating omnidirectional images using images that are
acquired from a virtual environment.
• For this purpose, it demonstrates the relevance of the proposed framework on two well-known
simulators: CARLA Simulator, which is an open-source simulator for autonomous driving research, and
Grand Theft Auto V(GTA V), which is a very high quality video game.
• It explains in details the generated OmniScape dataset, which includes stereo fisheye and catadioptric
images acquired from the two front sides of a motorcycle, including semantic segmentation, depth
map, intrinsic parameters of the cameras and the dynamic parameters of the motorcycle.
• It is worth noting that the case of two-wheeled vehicles is more challenging than cars due to the
specific dynamic of these vehicles.

Recording platform and a representation of the different modalities

Lookup table construction to set the omnidirectional image pixel values

The omnidirectional camera model

Universal Semantic Segmentation for Fisheye
Urban Driving Images
• When performing semantic image segmentation, a wider field of view (FoV) helps to
obtain more information about the surrounding environment, making automatic driving
safer and more reliable, which could be offered by fisheye cameras.
• However, large public fisheye datasets are not available, and the fisheye images captured
by the fisheye camera with large FoV comes with large distortion, so commonly-used
semantic segmentation model cannot be directly utilized.
• In this paper, a 7 DoF augmentation method is proposed to transform rectilinear image
to fisheye image in a more comprehensive way.
• In training, rectilinear images are transformed into fisheye images in 7 DoF, which
simulates the fisheye images from different positions, orientations and focal lengths.
• The result shows that training with the seven-DoF augmentation can improve the models
accuracy and robustness against different distorted fisheye data.
• This seven-DoF augmentation provides a universal semantic segmentation solution for
fisheye cameras in different autonomous driving applications.
• The code and configurations are released at https://github.com/Yaozhuwa/FisheyeSeg.

Projection model of fisheye camera. PW is a
point on a rectilinear image that we place on
the x-y plane of the world coordinate system.
Ɵ is the Angle of incidence of the point
relative to the fisheye camera. P is the
imaging point of PW on the fisheye image.
|OP| = fƟ. The relative rotation and
translation between the world coordinate
system and the camera coordinate system
results in six degrees of freedom.

The six DoF augmentation.
Except the first row, every
image is transformed using
a virtual fisheye camera
with focal length of 300
pixels. The letter in brackets
means that which axis the
camera is panning along or
rotating around.

the synthetic fisheye images with different f(focal length)

1. Base Aug: random clipping + random flip + color
jitter + z-aug of fixed focal length
2. RandF Aug: Base Aug + random focal length
3. RandR Aug: Base Aug + random rotation
4. RandT Aug: Base Aug + random translation
5. RandFR Aug: Base Aug + random focal length +
random rotation
6. RandFT Aug: Base Aug + random focal length +
random translation
7. Six-DoF Aug: Base Aug + random rotation +
random translation
8. Seven-DoF Aug: Base Aug + random focal length
+ random rotation + random translation
Seven-DoF Augmentation

Vehicle Re-ID for Surround-view Camera System
• The vehicle re-identification (Re-ID) plays a critical role in the perception system of
autonomous driving, which attracts more and more attention in recent years.
• However, there is no existing complete solution for the surround-view system mounted
on the vehicle.
• Two main challenges in above scenario: i) In single-camera view, it is difficult to recognize
the same vehicle from the past image frames due to the fish-eye distortion, occlusion,
truncation, etc. ii) In multi-camera view, the appearance of the same vehicle varies
greatly from different cameras viewpoints.
• Thus, an integral vehicle Re-ID solution to address these problems.
• Specifically, a quality evaluation mechanism to balance the effect of tracking boxes drift
and targets consistence.
• Besides, take advantage of the Re-ID network based on attention mechanism, then
combined with a spatial constraint strategy to further boost the performance between
different cameras.
• It will release the code and annotated fisheye dataset for the benefit of community.

360 surround-view camera system. Each
arrow points to an image captured by the
corresponding camera.

Vehicles in single view of fisheye camera. (a) The same vehicle features change dramatically in
consecutive frames and vehicles tend to obscure each other. (b) Matching errors are caused
by tracking results. (c) The vehicle center indicated by the orange box is stable while the IoU in
consecutive frames indicated by the yellow box decreases with movement.

The overall framework of vehicle Re-ID in single camera. Each object is assigned
a single tracker to realize Re-ID in single channel. Tracking templates are
initialized with object detection results. All tracking outputs are post-processed by
the quality evaluation module to deal with the distorted or occluded objects.

The overall framework of the vehicle Re-ID in multi-camera. For the new target, Re-ID model is used first
to extract the features, followed by the distance metrics is carried out for this feature and features in
gallery. Besides, the spatial constraint strategy is adopted to improve the correlation effect.

Samples captured by different cameras. (a) The appearances of the same vehicle
captured by different cameras vary greatly, and the same color represents the same
object. (b) Objects have a similar appearance may appear in the same camera
view, as shown by these two black vehicles in green boxes.

Illustration of the multi-camera Re-ID
network. This network is a 2 branch
parallel structure. The top branch is
employed to make the network pay
more attention on object regions, and
anther is for extracting global
features.

Projection uncertainty of key points. Ellipse 1 and ellipse 2 are
uncertainty ranges of front and left (right) cameras, respectively.

SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation
Synergized with Semantic Segmentation for Autonomous Driving
• In this paper, introduce a novel multi-task learning strategy to improve self
supervised monocular distance estimation on fisheye and pinhole camera images.
• The contribution to this work is threefold:
• Firstly, we introduce a novel distance estimation network architecture using a self-attention
based encoder coupled with robust semantic feature guidance to the decoder that can be
trained in a one-stage fashion.
• Secondly, we integrate a generalized robust loss function, which improves performance
significantly while removing the need for hyperparameter tuning with the reprojection loss.
• Finally, we reduce the artifacts caused by dynamic objects violating static world assumption
by using a semantic masking strategy.
• As there is limited work on fisheye cameras, it is evaluated on KITTI using a
pinhole model.
• It achieved state-of-the-art performance among self-supervised methods without
requiring an external scale estimation.

Overview over the joint prediction of distance
^Dt and semantic segmentation Mt from a
single input image It. Compared to previous
approaches, the semantically guided
distance estimation produces sharper depth
edges and reasonable distance estimates for
dynamic objects.

• The self-supervised depth and distance estimation is developed within a self-
supervised monocular structure-from-motion (SfM) framework which requires two
networks aiming at learning:
• 1. a monocular depth/distance model gD : It -> ^Dt predicting a scale-ambiguous
depth or distance (the equivalent of depth for general image geometries) ^Dt =
gD(It(ij)) per pixel ij in the target image It;
• 2. an ego-motion predictor gT : (It; It’ ) -> Tt->t0 predicting a set of 6 degrees of
freedom which implement a rigid transformation Tt->t’ ∊ SE(3), between the target
image It and the set of reference images It’. Typically, t’ ∊ {t + 1; t – 1}, i.e. the frames
It-1 and It+1 are used as reference images, although using a larger window is possible.

Overview of proposed
framework for the joint
prediction of distance and
semantic segmentation. The
upper part (blue blocks)
describes the single steps for
the depth estimation, while the
green blocks describe the
single steps needed for the
prediction of the semantic
segmentation.

Visualization of proposed network architecture
to semantically guide the depth estimation.
They utilize a self-attention-based encoder
and a semantically guided decoder using
pixel-adaptive convolutions.

Quantitative performance comparison of network with other self-supervised monocular methods for depths
up to 80m for KITTI. Original uses raw depth maps for evaluation, and Improved uses annotated depth
maps. At test-time, all methods excluding FisheyeDistanceNet, PackNet-SfM and this method, scale the
estimated depths using median ground-truth LiDAR depth.

Qualitative result comparison on the Fisheye WoodScape dataset between the baseline model without
contributions and the proposed SynDistNet. SynDistNet can recover the distance of dynamic objects (left
images) which eventually solves the infinite distance issue. In the 3rd and 4th columns, can see that
semantic guidance helps to recover the thin structure and resolve the distance of homogeneous areas
outputting sharp distance maps on raw fisheye images.

Towards Autonomous Driving: a Multi-Modal
360 Perception Proposal
• A multi-modal 360 framework for 3D object detection and tracking for
autonomous vehicles is presented.
• The process is divided into four main stages.
• First, images are fed into a CNN network to obtain instance segmentation of the
surrounding road participants.
• Second, LiDAR-to-image association is performed for the estimated mask proposals.
• Then, the isolated points of every object are processed by a PointNet ensemble to
compute their corresponding 3D bounding boxes and poses.
• A tracking stage based on Unscented Kalman Filter is used to track the agents along
time.
• The solution, based on a sensor fusion configuration, provides accurate and
reliable road environment detection.
• A wide variety of tests of the system, deployed in an autonomous vehicle,
have successfully assessed the suitability of the proposed perception stack
in a real autonomous driving application.

• The following sensors are employed:
• Five CMOS cameras equipped with an 85-HFOV lens.
• A 32-layer LiDAR scanner featuring a minimum vertical resolution of 0.33 and a range of
200m (Velodyne Ultra Puck).
• Accurate synchronization and calibration between sensors are of paramount
importance.
• Hence, they all are synchronized with the clock provided by a GPS receiver, and
cameras are externally triggered at a 10 Hz rate.
• Regarding calibration, cameras’ intrinsic parameters are obtained through the
checkerboard-based approach by Zhang, and extrinsic parameters representing
the relative position between sensors are estimated through a monocular-ready
variant of the velo2cam method.
• The result of this automatic procedure is further validated by visual inspection.

• The proposed solution is based on three pillars.
• First, visual data is employed to perform detection and instance level
semantic segmentation.
• Then, LiDAR points whose image projection falls within each obstacle
bounding polygon are employed to estimate its 3D pose.
• Finally, the tracking stage provides consistency, thus mitigating
occasional misdetections and enabling trajectory prediction.
• The combination of these three stages allows accurate and robust
identification of the dynamic agents surrounding the vehicle.

System overview. Images from all the cameras are processed by individual instances of Mask R-
CNN, which provide detections endowed with a semantic mask. LiDAR points in these regions are
used as an input for several F-PointNets responsible for estimating a 3D bounding box and
estimate its position with respect to the car. Then, 3D detections from each camera are fused using
an NMS procedure. A subsequent tracking stage provides consistency across several frames and
avoids temporary misdetections.

Qualitative results of the proposed system on some typical traffic scenarios. From top to bottom: 3D
detections in rear-left, front-left, front, front-right, and rear-right cameras, and Bird’s Eye View representation.

Fisheye-Omnidirectional View in Autonomous Driving III

Fisheye-Omnidirectional View in Autonomous Driving III

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fisheye-Omnidirectional View in Autonomous Driving III

Similar to Fisheye-Omnidirectional View in Autonomous Driving III (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Fisheye-Omnidirectional View in Autonomous Driving III