This document summarizes a method for single-view 3D reconstruction using differentiable ray sampling. It discusses prior work using 3D or 2D supervision and their limitations. The proposed method uses a neural 3D representation that maps coordinates to occupancy. It introduces differentiable ray sampling to allow end-to-end training with only 2D images. Results on cars and chairs show the method achieves similar or better accuracy compared to prior work, with constant memory usage at high resolutions.
3. Single-view 3D reconstruction
● 3D supervision
○ A large number of 3D datas are needed.
[Kato+ CVPR 2019]
Input
(image)
Output
(3D geometry)
prediction model
4. Single-view 3D reconstruction
● 2D supervision
○ End-to-end training: only 2D images.
○ Differentiable renderer is needed.
[Kato+ CVPR 2019]
Input
(image)
prediction model
Rendering
3D geometry Output
(image)
5. Single-view 3D reconstruction
● 3D Geometry representation
1. [Kato+ CVPR 2017]
2. [Tulsiani+ CVPR 2018]
3. [Sitzmann+ arXiv 2019]
Mesh1
Voxel2 Neural 3D
(SRN3
)
Neural 3D
(Ours)
initial shape ✕ ◯ ◯ ◯
memory
vs
resolution
◯ ✕ ◯ ◯
the number
of train views
◯ ◯ (✕) ◯
Accuracy
(IoU)
0.71 0.73 - ???
9. Ours
Voxel grid representation as function :
(xi
, yi
, zi
) → (Occupancy)
323
discrete input
Memory increases cubically with higher resolution
DRC (Tulsiani+ CVPR 2017) Our idea
x
y
z
Occupancy
Neural 3D representation :
(x, y, z) → (Occupancy)
Continuous input
Constant memory with high resolution
16. SRN (Sitzmann+ NIPS 2019)
Encoder
Decoder
Input
(image)
Rendered
image
parameters
x
y
z
3D Networks
pixel generator
SDF (?)
di
d1
d2
d0
The part of rendering is also a networks.
→ 50 images per 1 object for training