單個RGB-D圖像的深度補(bǔ)全
主頁:http://deepcompletion.cs.princeton.edu/
Github:https://github.com/yindaz/DeepCompletionRelease
Paper:http://deepcompletion.cs.princeton.edu/paper.pdf
Abstract
Goal--complete the depth channel of
an RGB-D image
Problem--Commodity-grade depth cameras often fail to sense depth for shiny, bright, transparent, and distant surfaces
Method--takes an RGB image as input and predicts dense surface normals and occlusion boundaries. Those predictions are then combined with raw depth observations provided by the RGB-D camera to solve for depths for all pixels, including those missing in the original observation
Introduction
Goal: to complete the depth channel of
an RGB-D image captured with a commodity camera (i.e., fill all the holes) 填充深度圖的空缺
以前的depth inpainting (深度修復(fù))方法 使用 hand-tuned(手工調(diào)整)來解決溉知,該方法通過 外推邊界表面陨瘩、馬爾可夫圖像合成;來fill holes
深度網(wǎng)絡(luò)已經(jīng)用于depth estimation级乍,但還未用來depth completion,因為有以下幾個難點:
- Training data
Large-scale training sets are not readily available for captured RGB-D images paired with ”completed” depth images (e.g., where ground-truth depth is provided for holes)
對于和 補(bǔ)全的深度圖 配對的 捕獲的RGB-D圖像舌劳,這樣的大規(guī)模訓(xùn)練數(shù)據(jù)不易獲得
這樣 depth estimation只能重現(xiàn) observed depth,不能估計 unobserved 深度信息
本文引入了新數(shù)據(jù)集玫荣,105432張RGB-D 圖像甚淡,且與在72個真實環(huán)境中從大規(guī)模表面重建計算出的完整深度圖像對應(yīng)
- Depth representation
深度表示
直接用FCN回歸depth不work尤其是對圖1中缺失那么大的一片區(qū)域,因為即使對人類來說捅厂,從單眼 color image 估計精確深度也很難啊
所以本文先用網(wǎng)絡(luò)預(yù)測depth的 局部微分屬性:表面法線&遮擋邊界
- Deep network design 深度網(wǎng)絡(luò)設(shè)計
以前沒有人去訓(xùn)練一個端對端網(wǎng)絡(luò)去從RGB-D圖像補(bǔ)全深度
一個想法是擴(kuò)展一下前人color to depth 的網(wǎng)絡(luò)贯卦,但是
they generally learn only to copy and interpolate the input depth
它們通常只學(xué)習(xí)對輸入的depth復(fù)制和插值。
It is also challenging for the network to learn how to adapt for misalignments of color and depth
這里的 不對齊 具體指啥焙贷,有空間位置的不對齊嗎.有顏色信息的像素不一定有深度信息撵割?
本文是 只將 color image 作為輸入,先預(yù)測 local surface normals and occlusion boundaries with supervision,因為 從顏色信息預(yù)測局部特征是深度網(wǎng)絡(luò)可以勝任的辙芍。然后通過一個把這些預(yù)測和輸入depth組合起來的全局優(yōu)化問題來complete depth
The coarse-scale structure of the scene is reconstructed through global optimization with regularization from the input depth
通過從輸入深度進(jìn)行正則化的全局優(yōu)化來重建場景的粗尺度結(jié)構(gòu)啡彬。
Main Insight
- prediction of surface normals and occlusion boundaries only from color
- optimization of global surface structure from those predictions with soft constraints provided by observed depths
好處:這樣做smaller relative error ,網(wǎng)絡(luò)獨立于observed depth故硅,不需要因為新的depth sensors再次訓(xùn)練 ?
Related work
- Depth estimation
depth estimation from a monocular color image 單目彩色圖像
- Classic methods:
Shape-from-shading
R. Zhang, P.-S. Tsai, J. E. Cryer, and M. Shah. Shape-from-shading: a survey. IEEE transactions on pattern analysis and machine intelligence, 21(8):690–706, 1999.
Shape-from-defocus
S. Suwajanakorn, C. Hernandez, and S. M. Seitz. Depth from focus with your mobile phone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3497–3506, 2015. 2
Others
based on hand-tuned models and/or assumptions about surface orientations
D. Hoiem, A. A. Efros, and M. Hebert. Automatic photo pop-up. ACM transactions on graphics (TOG), 24(3):577– 584, 2005.
A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from single monocular images. In Advances in neural information processing systems, pages 1161–1168, 2006.
A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824– 840, 2009. 2
- Newer methods
based on DL
D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolu- tional architecture. In Proceedings ofthe IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth Interna- tional Conference on, pages 239–248. IEEE, 2016.
F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from sin- gle monocular images using deep convolutional neural fields. IEEE transactions on pattern analysis and machine intelli- gence, 38(10):2024–2039, 2016.
A. Roy and S. Todorovic. Monocular depth estimation using neural regression forest. In Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 5506–5514, 2016.
we focus on depth completion, where the explicit goal is to make novel predictions for pixels where the depth sensor has no return.
上面的方法只能重現(xiàn) 商用RGB-D獲得的原始深度外遇,本文關(guān)注的 深度補(bǔ)全 是對深度傳感器沒有返回值 的像素點進(jìn)行全新的深度預(yù)測
- Depth inpainting 深度修復(fù)
filling holes in depth channels of RGB-D images
-Old methods
D. Herrera, J. Kannala, J. Heikkil¨a, et al. Depth map inpainting under a second-order smoothness prior. In Scandinavian Conference on Image Analysis, pages 555–566. Springer, 2013
X. Gong, J. Liu, W. Zhou, and J. Liu. Guided depth enhancement via a fast marching method. Image and Vision Computing, 31(10):695–703, 2013.
J. Liu, X. Gong, and J. Liu. Guided inpainting and filtering for kinect depth maps. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 2055–2058. IEEE, 2012.
M. Bertalmio, A. L. Bertozzi, and G. Sapiro. Navier-stokes, fluid dynamics, and image and video inpainting. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings ofthe 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
J. Liu and X. Gong. Guided depth enhancement via anisotropic diffusion. In Pacific-Rim Conference on Multimedia, pages 408–417. Springer, 2013.
K. Matsuo and Y. Aoki. Depth image enhancement using local tangent plane approximations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 3574–3583, 2015.
S. M. Muddala, M. Sjostrom, and R. Olsson. Depth-based inpainting for disocclusion filling. In 3DTV-Conference: The True Vision-Capture, Transmission and Display of3D Video (3DTV-CON), 2014, pages 1–4. IEEE, 2014.
A. K. Thabet, J. Lahoud, D. Asmar, and B. Ghanem. 3d aware correction and completion of depth maps in piecewise planar scenes. In Asian Conference on Computer Vision, pages 226–241. Springer, 2014.
W. Chen, H. Yue, J. Wang, and X. Wu. An improved edge detection algorithm for depth map inpainting. Optics and Lasers in Engineering, 55:69–77, 2014.
H.-T. Zhang, J. Yu, and Z.-F. Wang. Probability contour guided depth map inpainting and superresolution using nonlocal total generalized variation. Multimedia Tools and Applications, pages 1–18, 2017.
Y. Zuo, Q. Wu, J. Zhang, and P. An. Explicit edge inconsis- tency evaluation model for color-guided depth map enhancement. IEEE Transactions on Circuits and Systems for Video Technology, 2016.
H. Xue, S. Zhang, and D. Cai. Depth image inpainting: Improving low rank matrix completion with low gradient regularization. IEEE Transactions on Image Processing, 26(9):4311–4320, 2017.
M. Kulkarni and A. N. Rajagopalan. Depth inpainting by tensor voting. JOSA A, 30(6):1155–1165, 2013.
M. Liu, X. He, and M. Salzmann. Building scene models by completing and hallucinating depth and semantics. In European Conference on Computer Vision, pages 258–274. Springer, 2016
J. T. Barron and J. Malik. Intrinsic scene properties from a single rgb-d image. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 17–24, 2013.
M. Ciotta and D. Androutsos. Depth guided image completion for structure and texture synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 1199–1203.
D. Doria and R. J. Radke. Filling large holes in lidar data by inpainting depth gradients. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 65–72.
J. Gautier, O. Le Meur, and C. Guillemot. Depth-based image completion for view synthesis. In 3DTV Conference: The True Vision-capture, Transmission and Display of 3D Video (3DTV-CON), 2011, pages 1–4.
- DL methods:
Auto-encoder
A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixel-cnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016
GAN
D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016
先前的方法沒有研究depth images的inpainting,由于depth images缺少魯棒的特征strong features 和 大規(guī)模訓(xùn)練數(shù)據(jù)契吉,這是比較難的問題
- Depth super-resolution 深度超分辨
to improve the spatial resolution of depth images using high-resolution color
Markov random fields
[44] [12] [42] [51] [58]
O. Mac Aodha, N. D. Campbell, A. Nair, and G. J. Brostow. Patch based synthesis for single depth image super-resolution. In European Conference on Computer Vision, pages 71–84. Springer, 2012.
J. Diebel and S. Thrun. An application of markov random fields to range sensing. In Advances in neural information processing systems, pages 291–298, 2006
J. Lu, D. Min, R. S. Pahwa, and M. N. Do. A revisit to mrf-based depth map super-resolution and enhancement. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 985–988. IEEE, 2011
J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon. High quality depth map upsampling for 3d-tof cameras. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1623–1630. IEEE, 2011
E. Shabaninia, A. R. Naghsh-Nilchi, and S. Kasaei. High-order markov random field for single depth image super- resolution. IET Computer Vision, 2017
Shape-from-shading
[23] [71]
Y. Han, J.-Y. Lee, and I. So Kweon. High quality shape from a single rgb-d image under uncalibrated natural illumination. In Proceedings ofthe IEEE International Conference on Computer Vision, pages 1617–1624, 2013
L.-F. Yu, S.-K. Yeung, Y.-W. Tai, and S. Lin. Shading-based shape refinement of rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1415–1422, 2013
Segmentation
[41]
J. Lu and D. Forsyth. Sparse depth super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2245–2253, 2015
Dictionary methods
[18] [30] [45] [63]
W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Computer graphics and Applications, 22(2):56–65, 2002
M. Kiechle, S. Hawe, and M. Kleinsteuber. A joint intensity and depth co-sparse analysis model for depth map super-resolution. In Proceedings ofthe IEEE International Conference on Computer Vision, pages 1545–1552, 2013
M. Mahmoudi and G. Sapiro. Sparse representations for range data restoration. IEEE Transactions on Image Processing, 21(5):2909–2915, 2012
I. Tosic and S. Drewes. Learning joint intensity-depth sparse representations. IEEE Transactions on Image Processing, 23(5):2122–2132, 2014
盡管一些方法可以被用來 depth completion跳仿,但兩者的關(guān)注點是不同的。
where low-resolution measurements are assumed to be complete and regularly sampled. In contrast, our focus is on filling holes, which can be quite large and complex and thus require synthesis of large-scale content
- Depth reconstruction from sparse samples 稀疏樣本的深度重建
其他工作已經(jīng)研究了用稀疏深度測量集增強(qiáng)的彩色圖像的深度重建捐晶。
S. Hawe, M. Kleinsteuber, and K. Diepold. Dense disparity maps from sparse disparity measurements. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2126–2133
L.-K. Liu, S. H. Chan, and T. Q. Nguyen. Depth reconstruction from sparse samples: Representation, algorithm, and sampling. IEEE Transactions on Image Processing, 24(6):1983–1996, 2015
F. Ma and S. Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. arXiv preprint arXiv:1709.07492, 2017
但是該研究的motivation是降低一定設(shè)置下的傳感成本(例如 節(jié)省機(jī)器人的成本)菲语,而不是depth completion
Method
對應(yīng)于introduction提到的三個難點妄辩,本文的研究也聚焦于以下三個問題:
- how can we get training data for depth completion
- what depth representation should we use
- how should cues from color and depth be combined
Dataset
to create a dataset of RGB-D
images paired with completed depth images
- Straight forward approach:
to capture images with a low-cost RGB-D camera and align them to images captured simultaneously with a higher cost depth sensor
這項任務(wù)的直接方法是使用低成本的RGB-D攝像機(jī)捕獲圖像,并將它們與使用成本較高的深度傳感器同時捕獲的圖像對齊
但這方法昂貴耗時山上,這個類型的public dataset只包含少量的 indoor scenes
- 本文方法
utilize existing surface meshes reconstructed from multi-view RGB-D scans of large environments
for example: Matterport3D [6], ScanNet [10], SceneNN [28], and SUN3D[22,67]
a) For each scene, we extract a triangle mesh M with ~1-6 million triangles per room from a global surface reconstruction using screened Poisson surface reconstruction.
對每個場景眼耀,使用篩選的泊松表面重建(screened Poisson surface reconstruction)從全局表面重構(gòu)中提取三角形網(wǎng)格M,該網(wǎng)格在每個房間中含有1-6百萬個三角形
b) for a sampling of RGB-D images in the scene, we render the reconstructed mesh M from the camera pose of the image viewpoint to acquire a completed depth image D*
對于場景中RGB-D圖像的一個采樣佩憾,從圖像視點的攝像機(jī)姿態(tài)渲染重建的網(wǎng)格M哮伟,從而得到完整的depth image D*
這就得到了包含 RGB-D & D* 圖像對的數(shù)據(jù)集!
疑問:多視點的 RGB-D images 的結(jié)合 是需要圖像之間的 配準(zhǔn)regirstration 吧妄帘?難道這個得到mesh的過程是原來的數(shù)據(jù)集現(xiàn)成的楞黄?全局表面重建是數(shù)據(jù)集現(xiàn)有的
參見
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3d: Learning from rgb-d data in indoor environments. Interna- tional Conference on 3D Vision (3DV), 2017
- 這樣的得到的completed depth image 的優(yōu)點,這些特性有助于網(wǎng)絡(luò)的訓(xùn)練:
a)
have fewer holes. On average, 64.6% of the pixels missing from the raw depth images are filled in by our reconstruction process
b)
the completed depth images generally replicate
the resolution of the originals for close-up surfaces, but provide far better resolution for distant surfaces.
深度補(bǔ)全的圖像對于近距離的表面復(fù)制了原始的分辨率,遠(yuǎn)距離的表面會比原始分辨率有提高
由于表面重建以與深度相機(jī)的分辨率相當(dāng)?shù)?D網(wǎng)格尺寸構(gòu)造抡驼,因此在完成的深度圖像中通常不會損失分辨率鬼廓。 然而,當(dāng)投影到視平面上時致盟,相同的3D分辨率為遠(yuǎn)離相機(jī)的表面提供了有效的更高像素分辨率碎税。 因此,完成的深度圖像可以在渲染高分辨率網(wǎng)格時利用子像素抗鋸齒來獲得比原始圖像更精細(xì)的分辨率(請注意圖3中家具中的細(xì)節(jié))馏锡。WHY
c)
completed depth images generally have far less noise than the originals
由于表面重建算法通過濾波和平均來組合來自許多攝像機(jī)視圖的噪聲深度樣本雷蹂,因此它基本上對表面進(jìn)行去噪。 這對于遠(yuǎn)距離觀測(例如杯道,> 4米)尤其重要萎河,其中原始深度測量被量化并且噪聲很大
本文的數(shù)據(jù)集有 117516 RGB-D images with rendered completions:
Training set:105432;Test set:12084
Depth Representation
what geometric representation is best for deep depth completion
- Straight-forward approach:
to design a network that
regresses completed depth from raw depth and color.
然而difficult to predict absolute depth from monocular images,
as it may require knowledge of object sizes,scene categories,etc
本文是預(yù)測每個像素的局部屬性,表面法線蕉饼、遮擋邊界
- Indirect representation of depth
a) relative depth [7]
b) depth derivatives [5] 做了實驗,但不是最好的
c) depth derivatives & depth [35]
why use 表面法線 遮擋邊界:
normals are differential surface properties差分表面屬性, they depend only on local neighborhoods of pixels;
relate strongly to local lighting variations directly observable in a color image 和彩色圖像中可直接觀察到的局部光照變化非常相關(guān) ?
so,的工作在從顏色圖像到表面法線的密集預(yù)測效果不錯 [1,15,34,66,75]
occlusion boundaries produce local patterns in pixels (e.g., edges), and so they usually can be robustly detected with a deep network [14, 75]
那么,如何從surface normals & occlusion boundary 計算深度:
- 理論上來說那是不可能的:
the depth relationships between different parts of the image cannot be inferred only from normals
見圖4 a
- 在現(xiàn)實場景中朋譬,圖像區(qū)域不太可能同時被遮擋邊界包圍音同,并且根本不包含原始深度觀察(圖4 b)
所以即使是較大的空缺,也可以使用surface normals 補(bǔ)全深度矿酵,有預(yù)測遮擋邊界加權(quán)。
其中預(yù)測的表面法線具有通過預(yù)測的遮擋邊界加權(quán)的相干性,并且由觀察到的原始深度限制正則化
Network Architecture and training
what is the best way to
train a deep network to predict surface normals and occlusion boundaries for depth completion?
- Network architecture:
[75]
Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser. Physically-based rendering for indoor scene understanding using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
The model is a fully convolutional neural network built on the back-bone of VGG-16 with symmetry encoder and decoder 基于VGG16叹侄,有對稱的編碼器和解碼器
用從重建網(wǎng)格計算出的表面法線和輪廓邊界來訓(xùn)練網(wǎng)絡(luò)
- How to train it
本文以法線估計為例,遮擋邊界檢測同理
a) What loss should be used to train the network
two choices: trained only on holes vs all pixels:
- loss of all pixels (oberved and holes pixels)
- loss for only unobserved pixels (holes) by masking out the gradients on other pixels during the back-propagation
trained with rendered normals VS raw normals ?
詳見論文附件
對比實驗結(jié)果:
the models trained with all pixels perform better than the ones using only observed or only unobserved pixels, and ones trained with rendered normals perform better than with raw normals
結(jié)果顯示昨登,所有像素訓(xùn)練的模型比僅使用觀察到的或僅使用未觀察到的像素訓(xùn)練的模型表現(xiàn)更好趾代,而使用渲染法線訓(xùn)練的模型比使用原始法線訓(xùn)練的模型表現(xiàn)更好
b) What image channels should be input to the network
實驗表明如果用RGB-D作為輸入來預(yù)測法線,在holes部分像素的預(yù)測很差(盡管對于observed pixels work)丰辣,推測這種網(wǎng)絡(luò)只是從RGB-D中的depth channel預(yù)測normals撒强,所以對于hole就不能work了
圖5的結(jié)論啟發(fā)作者僅用color image 預(yù)測 surface normals
separating “prediction without depth” from “optimization with depth” is compelling for two reasons:好處
- 對于不同的深度傳感器不用重訓(xùn)練網(wǎng)絡(luò) (==不理解換了sensor就得re train==禽捆?)
- 優(yōu)化可以推廣到==各種深度觀測作為正則化==,包括稀疏深度樣本[43]
Optimizations
前面的網(wǎng)絡(luò)預(yù)測得到 surface normal image N 和 occlusion boundary image B(==長什么樣子飘哨?==)
求解一個方程組
目標(biāo)函數(shù)由 四個 平方誤差的加權(quán)求和
$E_D$
:估計的深度和原始觀察到的深度的距離
$E_N$
:預(yù)測的深度 和表面法線的一致性 by 切線 點乘 法線
$E_S$
:促使相鄰像素有相似的深度值
B:$B ∈ [0, 1] $
down- weights the normal terms based on the predicted probability a pixel is on an occlusion boundary $(B(p))$
==提問:如果在邊界胚想,實際是不滿足法線垂直切線,所以減小他的權(quán)重芽隆?極端情況只考慮在遮擋邊界的$E_N$
??==
this objective function is non-linear,
due to the normalization of the tangent vector$v(p, q)$
required for the dot product with the surface normal in$E_N$
==疑問:本來平方誤差不就已經(jīng)是非線性了嗎==
approximate this error term with ==a linear formation by foregoing the vector normalization==, as suggested in [50].
目標(biāo)函數(shù)的矩陣形式 是 稀疏 且 對稱正定的浊服,所以可使用==a sparse Cholesky factorization [11] 稀疏 Cholesky 分解== 來求解 近似的目標(biāo)含函數(shù)
曲面法線和遮擋邊界(以及可選的深度導(dǎo)數(shù))僅捕獲==曲面幾何的局部屬性==,這使得它們相對容易估計胚吁。 只有通過全局優(yōu)化牙躺,我們才能將它們組合在一起,在一致的解決方案中補(bǔ)全所有像素的深度in a consistent solution囤采。
Experiment Result
Unless otherwise specified, networks were pretrained on the SUNCG dataset [60, 75] and fine-tuned on the training split of the our new dataset using only color as input and a loss computed for all rendered pixels.
Optimizations were performed with$λ_D = 10^3$
,$λ_N = 1$
, and$λ_S = 10^{-3}$
. Evaluations were performed on the test split of our new dataset.
time cost
Task | time | Hardware |
---|---|---|
normals & occlusion boundaries | ~0.3 s | NVIDIA TITAN X |
solving equations | ~1.5 s | Intel Xeon 2.4GHz CPU |
Ablation Studies
Evaluation metrics
- median error relative to the rendered depth(Rel) 相對于渲染深度的==中值誤差==
- the Root Mean Squares Error in meters(RMSE) 以米為單位的均方根誤差
- percentages of pixels with predicted depths falling within an interval(
$\delta=|predicted-true|/true|)$
$\delta=1.05,1.10 ,1.25,1.25^2,1.25^3$
(以上衡量depth error,下面是衡量surface normals)
- mean and median errors in degrees 以度為單位的平均誤差述呐、中值誤差
- the percentages of pixels with predicted normals less than thresholds of 11.25、22.5蕉毯、30 degrees 預(yù)測==法線小于這些閾值的百分比==?
how different
test inputs, training data, loss functions, depth representations, and optimization methods affect the depth prediction results
1. What data should be input to the network
table 1展示了不同輸入下的結(jié)果(表中箭頭向上 越大越好乓搬;反之,越小越好)
例如 normal 的 median error 17.28 < 23.59代虾;depth的 Rel 0.089<0.09
==補(bǔ)充材料==里還展示了不同loss 設(shè)置下(observed only VS unobserved only)进肯,這個優(yōu)勢依然存在
作者認(rèn)為當(dāng)為observed depth時,網(wǎng)絡(luò)會學(xué)習(xí)進(jìn)行插值而不是在holes合成新的depth棉磨。
++這個實驗結(jié)果促使本文將整個方法 分為兩個步驟 two stage system++ !!
2. What depth representation is best
train networks separately to predict depths(D),surface normals(N),and depth derivatives(DD) 深度導(dǎo)數(shù),then use different combinations to complete the depth by optimization Equation 1.
表二 注意這里的D是從depth 預(yù)測 depth
以Rel為例 N 0.089 < N+DD 0.092 < DD 0.100 < D 0.167江掩。
作者認(rèn)為由于表面法線只代表了orientation of surfaces ,比較好預(yù)測乘瓤,詳見[31]环形;而==且他不隨深度的變化而變化,在不同的視圖里更一致==
3. Does prediction of occlusion boundaries help
Whether down-weighting the effect of surface normals near predicted occlusion boundaries helps the optimizer solve for better depths
表2 yes 表示有B衙傀,No 表示沒有down-weights 對比 0.089<0.110抬吟,提升約 19%。
occlusion boundaries ==區(qū)域的surface normals是 嘈雜统抬,不準(zhǔn)確的火本?== 圖6
第2列是網(wǎng)絡(luò)輸出的法線和遮擋邊界,第2行第3聪建、4列為 是否有boundary weight 的對比钙畔。第1行的3、4列是從輸出的深度圖計算的surface normal金麸。遮擋(閉塞)邊界==提供了深度不連續(xù)性信息擎析,有助于保持邊界的清晰度/銳度==看從深度計算的法線圖
4. How much observed depth is necessary
to test how much depth completion method depends on the quantity 數(shù)量 of input depths
degraded the input depth images by ==randomly masking different numbers of pixels== before giving them to the optimizer to solve for completed depths from predicted normals and boundaries
圖 7
圖像橫軸是圖像中具有深度的像素的個數(shù)(未被masked),左圖展示了predicted depth accuracy of observed pixels,右圖為predicted depth accuracy of unobserved pixels
顯然unobserved 的accuracy 低于 observed挥下;但是只要有一小部分的輸入depth(==2000 depths 只占all pixels 的 2.5%==) .這從側(cè)面說明即使是其他depth sensors designs with sparse measurements叔锐,也能得到比較客觀的預(yù)測效果挪鹏,==也不用重訓(xùn)練網(wǎng)絡(luò)(網(wǎng)絡(luò)輸入只是顏色啊)== 但是你訓(xùn)練網(wǎng)絡(luò)時的ground truth normals 來自rendered depth image 坝淅印讨盒??如果只做個測試感覺確實不特別依靠raw depth的數(shù)目
Comparison to Baseline Methods
compares to baseline depth inpainting and depth estimation methods.
1.Comparison to Inpainting Methods
non-data-driven alternatives for depth inpainting
The focus of this study is to establish how well-known methods perform to provide a baseline on how hard the problem is for this new dataset 本研究的重點是確定知名方法的性能步责,以提供有關(guān)此新數(shù)據(jù)集的問題難度的基準(zhǔn)
表3
表中的對比方法分別是 聯(lián)合雙線性濾波返顺、快速雙邊求解、全局邊緣感知能量優(yōu)化
發(fā)現(xiàn)Rel是所有方法中最小的
圖8展示了與 聯(lián)合雙線性濾波的比較
圖8展示的結(jié)果看蔓肯,本文方法的深度圖邊界更精確
2. Comparison to Depth Estimation Methods
和color to depth的深度估計方法對比
[33] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab. Deeper depth prediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 239–248. IEEE, 2016
[5] A. Chakrabarti, J. Shao, and G. Shakhnarovich. Depth from a single image by harmonizing overcomplete local network predictions. In Advances in Neural Information Processing Systems, pages 2658–2666, 2016
表 4
本文方法個指標(biāo)都為最佳遂鹊,提升23-40%。 Y表示 observed depth N表示 unobserved
這也表明 預(yù)測法線 對于深度估計問題也是不錯的方法
注意看蔗包,不僅預(yù)測的深度更準(zhǔn)確秉扑,而且通過對比計算出的surface normals,說明本文方法學(xué)習(xí)到了更好的場景幾何結(jié)構(gòu)
Conclusion
two main research contribution
First, it proposes to complete depth with a two stage process where surface normals and occlusion boundaries are predicted from color, and then completed depths are solved from those predictions.
Second, it learns to complete depth images by supervised training on data rendered from large-scale surface reconstruction 從大規(guī)模表面重建渲染得到的數(shù)據(jù)進(jìn)行監(jiān)督訓(xùn)練
搭建橋梁溝通了彩色圖和深度圖信息 橋就是normals!
顯而易見调限,這是一個犧牲時間換取圖像質(zhì)量的游戲
1.速度很慢舟陆。
分辨率320x256的圖像,使用NVIDIA TITAN X GPU還需要大約0.3秒耻矮;Intel Xeon 2.4GHz CPU上大約1.5秒.
2.依賴高性能硬件秦躯。難以控制成本