Loop closure detection is a critical component of large-scale simultaneous localization and mapping (SLAM) in loopy environments. This capability is challenging to achieve in long-term SLAM, when the environment appearance exhibits significant long-term variations across various time of the day, months, and even seasons. In this paper, we introduce a novel formulation to learn an integrated long-term representation based upon both holistic and landmark information, which integrates two previous insights under a unified framework: (1) holistic representations outperform keypoint-based representations, and (2) landmarks as an intermediate representation provide informative cues to detect challenging locations. Our new approach learns the representation by projecting input visual data into a low-dimensional space, which preserves both the global consistency (to minimize representation error) and the local consistency (to preserve landmarks’ pairwise relationship) of the input data. To solve the formulated optimization problem, a new algorithm is developed with theoretically guaranteed convergence. Extensive experiments have been conducted using two large-scale public benchmark data sets, in which the promising performances have demonstrated the effectiveness of the proposed approach.
more »
« less
Visual Place Recognition via Robust ℓ2-Norm Distance Based Holism and Landmark Integration
Visual place recognition is essential for large-scale simultaneous localization and mapping (SLAM). Long-term robot operations across different time of the days, months, and seasons introduce new challenges from significant environment appearance variations. In this paper, we propose a novel method to learn a location representation that can integrate the semantic landmarks of a place with its holistic representation. To promote the robustness of our new model against the drastic appearance variations due to long-term visual changes, we formulate our objective to use non-squared ℓ2-norm distances, which leads to a difficult optimization problem that minimizes the ratio of the ℓ2,1-norms of matrices. To solve our objective, we derive a new efficient iterative algorithm, whose convergence is rigorously guaranteed by theory. In addition, because our solution is strictly orthogonal, the learned location representations can have better place recognition capabilities. We evaluate the proposed method using two large-scale benchmark data sets, the CMU-VL and Nordland data sets. Experimental results have validated the effectiveness of our new method in long-term visual place recognition applications.
more »
« less
- PAR ID:
- 10129573
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 33
- ISSN:
- 2159-5399
- Page Range / eLocation ID:
- 8034 to 8041
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Place recognition is a core component of Simultaneous Localization and Mapping (SLAM) algorithms. Particularly in visual SLAM systems, previously-visited places are recognized by measuring the appearance similarity between images representing these locations. However, such approaches are sensitive to visual appearance change and also can be computationally expensive. In this paper, we propose an alternative approach adapting LiDAR descriptors for 3D points obtained from stereo-visual odometry for place recognition. 3D points are potentially more reliable than 2D visual cues (e.g., 2D features) against environmental changes (e.g., variable illumination) and this may benefit visual SLAM systems in long-term deployment scenarios. Stereo-visual odometry generates 3D points with an absolute scale, which enables us to use LiDAR descriptors for place recognition with high computational efficiency. Through extensive evaluations on standard benchmark datasets, we demonstrate the accuracy, efficiency, and robustness of using 3D points for place recognition over 2D methods.more » « less
-
Despite the advances in Human Activity Recognition, the ability to exploit the dynamics of human body motion in videos has yet to be achieved. In numerous recent works, re- searchers have used appearance and motion as independent inputs to infer the action that is taking place in a specific video. In this paper, we highlight that while using a novel representation of human body motion, we can benefit from appearance and motion simultaneously. As a result, bet- ter performance of action recognition can be achieved. We start with a pose estimator to extract the location and heat- map of body joints in each frame. We use a dynamic encoder to generate a fixed size representation from these body joint heat-maps. Our experimental results show that training a convolutional neural network with the dynamic motion representation outperforms state-of-the-art action recognition models. By modeling distinguishable activities as distinct dynamical systems and with the help of two stream net- works, we obtain the best performance on HMDB, JHMDB, UCF-101, and AVA datasets.more » « less
-
While machine learning approaches to visual emotion recognition oer great promise, current methods consider training and testing models on small scale datasets covering limited visual emotion concepts. Our analysis identies an important but long overlooked issue of existing visual emotion benchmarks in the form of dataset biases. We design a series of tests to show and measure how such dataset biases obstruct learning a generalizable emotion recognition model. Based on our analysis, we propose a webly supervised approach by leveraging a large quantity of stock image data. Our approach uses a simple yet eective curriculum guided training strategy for learning discriminative emotion features. We discover that the models learned using our large scale stock image dataset exhibit signicantly better generalization ability than the existing datasets without the manual collection of even a single label. Moreover, visual representation learned using our approach holds a lot of promise across a variety of tasks on dierent image and video datasets.more » « less
-
Loop closure detection is a fundamental problem for simultaneous localization and mapping (SLAM) in robotics. Most of the previous methods only consider one type of information, based on either visual appearances or spatial relationships of landmarks. In this paper, we introduce a novel visual-spatial information preserving multi-order graph matching approach for long-term loop closure detection. Our approach constructs a graph representation of a place from an input image to integrate visual-spatial information, including visual appearances of the landmarks and the background environment, as well as the second and third-order spatial relationships between two and three landmarks, respectively. Furthermore, we introduce a new formulation that formulates loop closure detection as a multi-order graph matching problem to compute a similarity score directly from the graph representations of the query and template images, instead of performing conventional vector-based image matching. We evaluate the proposed multi-order graph matching approach based on two public long-term loop closure detection benchmark datasets, including the St. Lucia and CMU-VL datasets. Experimental results have shown that our approach is effective for long-term loop closure detection and it outperforms the previous state-of-the-art methods.more » « less