<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Fast and Robust Place Recognition Approach for Stereo Visual Odometry Using LiDAR Descriptors</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/24/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10297589</idno>
					<idno type="doi">10.1109/IROS45743.2020.9341733</idno>
					<title level='j'>2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jiawei Mo</author><author>Junaed Sattar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Place recognition is a core component of Simultaneous Localization and Mapping (SLAM) algorithms. Particularly in visual SLAM systems, previously-visited places are recognized by measuring the appearance similarity between images representing these locations. However, such approaches are sensitive to visual appearance change and also can be computationally expensive. In this paper, we propose an alternative approach adapting LiDAR descriptors for 3D points obtained from stereo-visual odometry for place recognition. 3D points are potentially more reliable than 2D visual cues (e.g., 2D features) against environmental changes (e.g., variable illumination) and this may benefit visual SLAM systems in long-term deployment scenarios. Stereo-visual odometry generates 3D points with an absolute scale, which enables us to use LiDAR descriptors for place recognition with high computational efficiency. Through extensive evaluations on standard benchmark datasets, we demonstrate the accuracy, efficiency, and robustness of using 3D points for place recognition over 2D methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>&#8226; Achieving lower computational cost over existing approaches. We evaluate the proposed method on the KITTI dataset <ref type="bibr">[4]</ref> and the Oxford RobotCar dataset <ref type="bibr">[5]</ref>. We demonstrate the robustness of our method against drastic visual appearance changes across seasons as recorded in the RobotCar dataset, and show that it achieves higher accuracy and computational efficiency over existing methods. Further performance improvement is achieved by augmenting the LiDAR descriptor with image intensity information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>In the field of vSLAM, ORB-SLAM2 [6] is a recent development that demonstrates high accuracy and computational efficiency. In ORB-SLAM2, loop closure is detected by Bag-of-Words (BoW) using ORB features <ref type="bibr">[7]</ref>. A vocabulary tree is used in BoW to speed up feature matching and subsequent place queries. However, if the features are highly repetitive (e.g., plants), BoW may fail; an example is given in Fig. <ref type="figure">6</ref>. Similarly, LSD-SLAM [8] adopts FAB-MAP [9] for place recognition. Other than BoW, Fisher vectors [10] and VLAD <ref type="bibr">[11]</ref> also focus on 2D features. On the other hand, global image descriptors are also used to decide the similarity between images for place recognition. GIST <ref type="bibr">[12]</ref> is one example which encodes spatial layout properties (spatial frequencies) of the scene. It exhibits high accuracy if the viewing angle does not significantly change.</p><p>Recently, researchers adopted deep learning to place recognition and achieved impressive performance (e.g., NetVLAD <ref type="bibr">[13]</ref> and <ref type="bibr">[14]</ref>). NetVLAD trained a convolutional neural network to extract learned features and proposed a generalized VLAD layer to describe the image automatically. Their accuracy is promising but their computational cost is usually high so that they are not widely used in real-time vSLAM systems.</p><p>Neither BoW nor GIST is robust against visual appearance change, which is not ideal for long term (e.g., from summer to winter) vSLAM applications, in addition to being computationally expensive. In ORB-SLAM2, place recognition runs in a separate execution thread to achieve real-time performance. Direct vSLAM systems (e.g., <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>) have become popular in the past decade, which achieve higher performance in certain scenarios. Adapting BoW into direct vSLAM systems is challenging because features are not selected with the goal of being matched across frames. In LSD-SLAM mentioned above, an additional set of features are detected and matched separately, which are used specifically for place recognition, at a higher computational cost.</p><p>In LDSO <ref type="bibr">[17]</ref>, the point selection strategy of its direct vSLAM system <ref type="bibr">[15]</ref> is tuned to flavor features that can be matched across frames to enable BoW. Our proposed approach for place recognition, however, is more elegant for direct vSLAM systems if stereo cameras are available.</p><p>A number of 3D place recognition methods have been designed for RGB-D cameras or LiDAR sensors. RGB-D Mapping <ref type="bibr">[18]</ref> uses ICP <ref type="bibr">[19]</ref> to detect loop closure and RANSAC <ref type="bibr">[20]</ref> to get an initial pose for ICP. For LiDAR, place recognition methods can be categorized into local descriptors and global descriptors. Local descriptors use a subset of the points and describe them in a local neighborhood. Examples are Spin Image <ref type="bibr">[21]</ref> and SHOT <ref type="bibr">[22]</ref>. Spin Image describes a keypoint by a histogram of points lying in each bin of a vertical cylinder centered at that keypoint. SHOT creates a sphere around a keypoint and describes that keypoint by the histogram of normal angles in each bin in the sphere. Global methods describe the entire set of points. These methods can be more computationally efficient. Recent development includes NDT <ref type="bibr">[23]</ref>, M2DP <ref type="bibr">[24]</ref>, Scan Context <ref type="bibr">[25]</ref>, and DELIGHT <ref type="bibr">[26]</ref>. NDT classifies keypoints into line, plane, and sphere classes according to their neighborhoods. A histogram of these three classes is created to represent the point cloud. M2DP projects points onto multiple planes, and the histogram of point count in each bin on each projection plane is concatenated to get a signature of the point cloud. Scan Context aligns the point cloud to the vertical direction and represents it by the histogram of the maximal height of each bin on the horizontal plane. DELIGHT focuses on LiDAR intensity; the scan sphere is divided into 16 parts and the histogram of LiDAR intensity in each part is concatenated to represent the point cloud.</p><p>Cieslewski et. al. <ref type="bibr">[27]</ref> looked into the possibility of using the 3D points triangulated from Structure-from-Motion or vSLAM for place recognition. They proposed the NBLD descriptor <ref type="bibr">[27]</ref> for the 3D points from a vision-based system. A keypoint is described by its neighborhood points in a vertical cylinder. The point density of each bin in the cylinder is calculated and compared with neighborhoods to create a binary descriptor of that keypoint. Ye et. al. <ref type="bibr">[28]</ref> extended NBLD with a neural network. The vertical cylinder of NBLD is created in the same way; however, a neural network is trained to describe the cylinders, instead of calculating the point density. These are novel approaches in adopting point cloud descriptors into vision-based systems for place recognition.</p><p>In this work, we adapt global LiDAR descriptors into stereo-visual odometry for robust and efficient place recognition under visual appearance change. Direct vSLAM systems can easily adopt the proposed approach for place recognition without modifying their point selection strategy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. METHODOLOGY</head><p>Similar to the idea of <ref type="bibr">[27]</ref>, our method recognizes places based on the 3D points generated by visual odometry. The main difference is that the visual odometry in this work is running on stereo cameras. Specifically, we use SO-DSO <ref type="bibr">[29]</ref> as our stereo-visual odometry for its high accuracy and</p></div>			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Minnesota. Downloaded on October 01,2021 at 03:05:29 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
