skip to main content


Search for: All records

Award ID contains: 2024688

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propose a stage-wise approach, which combines the information flow from 2D-to-3D (3D bounding box proposal generation with a single 2D image) and 3D-to-2D (proposal verification by denoising with 3D-to-2D contexts) in a topdown manner. Specifically, we first obtain initial proposals from off-the-shelf backbone monocular 3D detectors. Then, we generate a 3D anchor space by local-grid sampling from the initial proposals. Finally, we perform 3D bounding box denoising at the 3D-to-2D proposal verification stage. To effectively learn discriminative features for denoising highly overlapped proposals, this paper presents a method of using the Perceiver I/O model [20] to fuse the 3D-to-2D geometric information and the 2D appearance information. With the encoded latent representation of a proposal, the verification head is implemented with a self-attention module. Our method, named as MonoXiver, is generic and can be easily adapted to any backbone monocular 3D detectors. Experimental results on the well-established KITTI dataset and the challenging large-scale Waymo dataset show that MonoXiver consistently achieves improvement with limited computation overhead. 
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  2. Decision-based black-box attacks often necessitate a large number of queries to craft an adversarial example. Moreover, decision-based attacks based on querying boundary points in the estimated normal vector direction often suffer from inefficiency and convergence issues. In this paper, we propose a novel query-efficient \b curvature-aware \b geometric decision-based \b black-box \b attack (CGBA) that conducts boundary search along a semicircular path on a restricted 2D plane to ensure finding a boundary point successfully irrespective of the boundary curvature. While the proposed CGBA attack can work effectively for an arbitrary decision boundary, it is particularly efficient in exploiting the low curvature to craft high-quality adversarial examples, which is widely seen and experimentally verified in commonly used classifiers under non-targeted attacks. In contrast, the decision boundaries often exhibit higher curvature under targeted attacks. Thus, we develop a new query-efficient variant, CGBA-H, that is adapted for the targeted attack. In addition, we further design an algorithm to obtain a better initial boundary point at the expense of some extra queries, which considerably enhances the performance of the targeted attack. Extensive experiments are conducted to evaluate the performance of our proposed methods against some well-known classifiers on the ImageNet and CIFAR10 datasets, demonstrating the superiority of CGBA and CGBA-H over state-of-the-art non-targeted and targeted attacks, respectively. 
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  3. This paper presents a neural incremental Structure-from-Motion (SfM) approach, Level-S2fM, which estimates the camera poses and scene geometry from a set of uncalibrated images by learning coordinate MLPs for the implicit surfaces and the radiance fields from the established key-point correspondences. Our novel formulation poses some new challenges due to inevitable two-view and few-view configurations in the incremental SfM pipeline, which complicates the optimization of coordinate MLPs for volumetric neural rendering with unknown camera poses. Nevertheless, we demonstrate that the strong inductive basis conveying in the 2D correspondences is promising to tackle those challenges by exploiting the relationship between the ray sampling schemes. Based on this, we revisit the pipeline of incremental SfM and renew the key components, including two-view geometry initialization, the camera poses registration, the 3D points triangulation, and Bundle Adjustment, with a fresh perspective based on neural implicit surfaces. By unifying the scene geometry in small MLP networks through coordinate MLPs, our Level-S2fM treats the zero-level set of the implicit surface as an informative top-down regularization to manage the reconstructed 3D points, reject the outliers in correspondences via querying SDF, and refine the estimated geometries by NBA (Neural BA). Not only does our Level-S2fM lead to promising results on camera pose estimation and scene geometry reconstruction, but it also shows a promising way for neural implicit rendering without knowing camera extrinsic beforehand. 
    more » « less
    Free, publicly-accessible full text available June 1, 2024
  4. Vision Transformers (ViTs) are built on the assumption of treating image patches as “visual tokens” and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin [32] and the PVTs [47], [48] by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at https:/github.com/iVMCL/PaCaViT. 
    more » « less
    Free, publicly-accessible full text available June 1, 2024
  5. This paper studies the challenging two-view 3D reconstruction problem in a rigorous sparse-view configuration, which is suffering from insufficient correspondences in the input image pairs for camera pose estimation. We present a novel Neural One-PlanE RANSAC framework (termed NOPE-SAC in short) that exerts excellent capability of neural networks to learn one-plane pose hypotheses from 3D plane correspondences. Building on the top of a Siamese network for plane detection, our NOPE-SAC first generates putative plane correspondences with a coarse initial pose. It then feeds the learned 3D plane correspondences into shared MLPs to estimate the one-plane camera pose hypotheses, which are subsequently reweighed in a RANSAC manner to obtain the final camera pose. Because the neural one-plane pose minimizes the number of plane correspondences for adaptive pose hypotheses generation, it enables stable pose voting and reliable pose refinement with a few of plane correspondences for the sparse-view inputs. In the experiments, we demonstrate that our NOPE-SAC significantly improves the camera pose estimation for the two-view inputs with severe viewpoint changes, setting several new state-of-the-art performances on two challenging benchmarks, i.e., MatterPort3D and ScanNet, for sparse-view 3D reconstruction. The source code is released at https://github.com/IceTTTb/NopeSAC for reproducible research. 
    more » « less
  6. This article presents Holistically-Attracted Wireframe Parsing (HAWP), a method for geometric analysis of 2D images containing wireframes formed by line segments and junctions. HAWP utilizes a parsimonious Holistic Attraction (HAT) field representation that encodes line segments using a closed-form 4D geometric vector field. The proposed HAWP consists of three sequential components empowered by end-to-end and HAT-driven designs: (1) generating a dense set of line segments from HAT fields and endpoint proposals from heatmaps, (2) binding the dense line segments to sparse endpoint proposals to produce initial wireframes, and (3) filtering false positive proposals through a novel endpoint-decoupled line-of-interest aligning (EPD LOIAlign) module that captures the co-occurrence between endpoint proposals and HAT fields for better verification. Thanks to our novel designs, HAWPv2 shows strong performance in fully supervised learning, while HAWPv3 excels in self-supervised learning, achieving superior repeatability scores and efficient training (24 GPU hours on a single GPU). Furthermore, HAWPv3 exhibits a promising potential for wireframe parsing in out-of-distribution images without providing ground truth labels of wireframes. 
    more » « less
  7. The integration of DNN-contextualized binary-pattern-driven non-parametric cost volume and DNN cost aggregation leads to more robust and more generalizable stereo matching. Abstract: Stereo matching is a classic challenging problem in computer vision, which has recently witnessed remarkable progress by Deep Neural Networks (DNNs). This paradigm shift leads to two interesting and entangled questions that have not been addressed well. First, it is unclear whether stereo matching DNNs that are trained from scratch really learn to perform matching well. This paper studies this problem from the lens of white-box adversarial attacks. It presents a method of learning stereo-constrained photometrically-consistent attacks, which by design are weaker adversarial attacks, and yet can cause catastrophic performance drop for those DNNs. This observation suggests that they may not actually learn to perform matching well in the sense that they should otherwise achieve potentially even better after stereo-constrained perturbations are introduced. Second, stereo matching DNNs are typically trained under the simulation-to-real (Sim2Real) pipeline due to the data hungriness of DNNs. Thus, alleviating the impacts of the Sim2Real photometric gap in stereo matching DNNs becomes a pressing need. Towards joint adversarially robust and domain generalizable stereo matching, this paper proposes to learn DNN-contextualized binary-pattern-driven non-parametric cost-volumes. It leverages the perspective of learning the cost aggregation via DNNs, and presents a simple yet expressive design that is fully end-to-end trainable, without resorting to specific aggregation inductive biases. In experiments, the proposed method is tested in the SceneFlow dataset, the KITTI2015 dataset, and the Middlebury dataset. It significantly improves the adversarial robustness, while retaining accuracy performance comparable to state-of-the-art methods. It also shows a better Sim2Real generalizability. Our code and pretrained models are released at \href{https://github.com/kelkelcheng/AdversariallyRobustStereo}{this Github Repo}. 
    more » « less
  8. Human-robot collaboration (HRC) is an emerging research area that has gained tremendous attention from both academia and industry. Since some robot-related factors can elicit mental stress or have negative psychological effects on human workers, it is essential to understand these factors and maintain workers’ mental stress at a low level. Galvanic Skin Response (GSR) measures skin conductance and is known to be a physiological measurement that reflects short-term mental stress. Typically, skin conductance increases in response to greater mental stress. In this study, the mental stress caused by the hand-over activities of a collaborative robot was investigated using both GSR as an objective measurement and NASA-Task Load Index (TLX) as a subjective assessment. Several robot-related factors that may lead to mental stress were experimentally examined. GSR outcomes indicated that end effector approaching within workers’ view, low end effector speed, and constrained end effector trajectory led to a significantly lower skin conductance. Some aspects of the NASA-TLX also indicated that speed and trajectory significantly affected the scores. Yet, no significant differences were found between approaching directions regarding NASA-TLX scores. 
    more » « less
  9. Work-related musculoskeletal disorders (MSDs) are often observed in human-robot collaboration (HRC), a common work configuration in modern factories. In this study, we aim to reduce the risk of MSDs in HRC scenarios by developing a novel model-free reinforcement learning (RL) method to improve workers’ postures. Our approach follows two steps: first, we adopt a 3D human skeleton reconstruction method to calculate workers’ Rapid Upper Limb Assessment (RULA) scores; next, we devise an online gradient-based RL algorithm to dynamically improve the RULA score. Compared with previous model-based studies, the key appeals of the proposed RL algorithm are two-fold: (i) the model-free structure allows it to “learn” the optimal worker postures without need any specific biomechanical models of tasks or workers, and (ii) the data-driven nature makes it accustomed to arbitrary users by providing personalized work configurations. Results of our experiments confirm that the proposed method can significantly improve the workers’ postures. 
    more » « less