<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Deep Unsupervised Visual Odometry Via Bundle Adjusted Pose Graph Optimization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/29/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10443218</idno>
					<idno type="doi">10.1109/ICRA48891.2023.10160703</idno>
					<title level='j'>IEEE International Conference on Robotics and Automation (ICRA)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Guoyu Lu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Unsupervised visual odometry as an active topic has attracted extensive attention, benefiting from its label-free practical value and robustness in real-world scenarios. However, the performance of camera pose estimation and tracking through deep neural network is still not as ideal as most other tasks, such as detection, segmentation and depth estimation, due to the lack of drift correction in the estimated trajectory and map optimization in the recovered 3D scenes. In this work, we introduce pose graph and bundle adjustment optimization to our network training process, which iteratively updates both the motion and depth estimations from the deep learning network, and enforces the refined outputs to further meet the unsupervised photometric and geometric constraints. The integration of pose graph and bundle adjustment is easy to implement and significantly enhances the training effectiveness. Experiments on KITTI dataset demonstrate that the introduced method achieves a significant improvement in motion estimation compared with other recent unsupervised monocular visual odometry algorithms.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>local neighboring frames, a graph-based pose optimization module and a pose-depth bundle adjusted optimization. To enable an efficient and practicable optimization, we propose to update only selected keypoints in the depth map in the optimization process, while inferring the entire dense depth. The use of all image pixels for optimization would result in the difficulty of convergence of model training due to the significant parameters to optimize (e.g., optimize hundreds of thousands images with over one hundred thousand pixels for each image). To the best of our knowledge, our proposed network is one of the first approaches to enable online optimization in the unsupervised deep VO structure. An overview of our training pipeline is depicted in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Unsupervised Monocular VO Pipeline</head><p>Given monocular video sequences, we are able to use geometric and photometric consistencies between the target frame to reference views to train depth estimation and motion estimation. As illustrated in Fig. <ref type="figure">2</ref>, the self-supervision simultaneously constrains the depth inference network and pose estimation network. Pose estimation network is trained by multiple adjacent local frames composed of a target frame I t and the referenced neighboring frames I t+1 . A group of relative poses are able to be inferred. Simultaneously, corresponding depth map for each input frame is generated by the depth estimation network. The initial estimated depth maps and pose vectors will then be optimized by the pose graph and bundle adjustment, which will be detailed in Sec. III-B and Sec. III-C.</p><p>1) Multi-view Re-projection Loss: Given each pair of two images I t and I t+1 , the estimated depth map D t , and the estimated camera motion T t-&gt;t+1 , we are able to compute the per-pixel correspondence by projecting the pixel of the target image to the reference images. Supposing a known camera intrinsic K, the correspondence of the pixel p t in I t+1 can be represented by the following equation:</p><p>To warp the target frame I t to reference frame I t+1 and constrain a smooth reconstruction &#296;t+1 , we compute the perpixel minimum photometric loss across multiple reference frames rather than the averaging photometric error <ref type="bibr">[11]</ref> as:</p><p>where N is the number of frames. &#961; is a weighted combination of L1 loss term and the structural similarity index measure (SSIM) loss <ref type="bibr">[14]</ref> to achieve a robust image reconstruction performance, denoted as:</p><p>2) Moving Object Masking: As the loss constraint Eq. 2 should meet the assumption of static scenes and moving cameras, objects with large motion and occlusions will create non-rigid transformation which will degrade the learning effect of camera pose and depth estimation. In this case, we propose to incorporate the depth inconsistency mask <ref type="bibr">[4]</ref> to exclude the moving objects and regions. The depth inconsistency map for each pixel value p is computed as:</p><p>where D t t+1 is the synthesized depth at t+1 frame generated from I t based on the estimated camera motion T t-&gt;t+1 , and D &#8242; t+1 is the bilinear interpolation of the estimated depth at t + 1 frame. So, the moving mask can be computed based on the depth inconsistency map D dif f as:</p><p>where M moving ranges from 0 to 1, which intends to give small weights to the regions containing moving and occluded objects. Considering that there could exist nonmoving frames in specific scenes (stopping), which may affect the training of camera motion estimation, we apply auto-masking to compute the photometric loss between the neighboring moving frames only, filtering out those points whose relative motion is the same:</p><p>where M auto is a binary mask. I &#8242; t is a warped frame from I t+1 based on the estimated depth map D and relative camera motion T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Pose Graph Optimization</head><p>Normally, pose estimation from the deep neural network suffers from a relatively large drift. We propose to incorporate pose graph to optimize each camera pose node c= [c 1 , c2 , ..., cn ] computed from the estimated relative rigid camera transformation T= [ T1 , T2 , ..., Tn ]. Let z ij = &#947;(c i , cj ) + n ij to be the edge of each camera pose vertex pair ci and cj , where the noise is formulated as a zero-mean white Gaussian as n ij &#8764; N (0, W ij ). The graph optimization is then described as a problem of maximizing the posterior probability of all points on the camera's trajectory, given the estimated camera pose c and the observed edge constraints &#947; between the pose nodes:</p><p>By following the Gaussian distribution assumption and taking the natural logarithm on both sides of Eq. 7, the maximum likelihood estimation can be easily converted to the minimization problem by the following least-square function:</p><p>where Eq. 8 is a non-linear least-square optimization and e ij is the error between z ij and the estimated value &#947;(c i , cj ).</p><p>To solve the optimization equation, iterative Gauss-Newton is used for solving Eq. 8. Specially, an optimization for the estimated camera pose c(n) at the current time n is calculated by the approximation of the second-order Taylorseries as:</p><p>where k is the corresponding residual vector as equation below, and &#915; n ij is the partial derivative of the edge constraint &#947; to the estimated camera pose c, and J is the Jacobian matrix which is composed of all the computed Jacobians &#915; as:</p><p>Eq. 9 can be further simplified by applying QR factorization on J. Hence, Eq. 9 can be rewritten as:</p><p>Hence, &#948;c can be computed as:</p><p>Based on the pose graph optimization, the optimized camera pose cupdate can be corrected from the initial estimation from the pose estimation network c and a small correction &#948;c as: cupdate = c + &#948;c. Hence, the relative pose estimation can be correspondingly refined to Tupdate = T + &#948;T .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Bundle Adjustment Integration</head><p>Considering that the pose graph optimization ignores the 3D point information and the self-supervision from the unsupervised VO network is able to constrain both initial scene depth D and the refined relative camera pose Tupdate , we propose to further refine them for more precise poses and depths by solving them in geometric bundle adjustment (BA) optimization. This process is formulated as minimizing the total energy E of the re-projection errors e on the image pixel p across all the frames as: ||I(p i,j ) -I i (&#960;( Tupdate,i , M ( Dj )))|| <ref type="bibr">(13)</ref> where the global energy E that needs to be minimized is composed by a series of errors between the pixel intensity of the projected 3D points and the corresponding image pixel. Considering that it is not practicable to optimize the entire depth estimated from the depth estimation network, we only selected 2000 keypoints (ORB feature is used in our setting) from the input image.</p><p>To minimize the global energy E over all depths at the selected keypoints and the corresponding camera motion, we define the parameter vector P and the measurement vector X as:</p><p>The estimated measurement vector X can be expressed as: </p><p>Therefore, the bundle adjustment optimization is equal to minimize the squared &#931; -1 X norm as:</p><p>&#931; represents convariance matrix. The above normal equation can be solved with Levenberg-Marquardt (LM) nonlinear least-square algorithm:</p><p>The updating vector for LM algorithm becomes: </p><p>And the Jacobian matrix J is:</p><p>Therefore, the covariance matrix becomes:</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Georgia. Downloaded on August 20,2023 at 06:32:58 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
