skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Title: Joint Learning of 3D Shape Retrieval and Deformation
We propose a novel technique for producing high-quality 3D models that match a given target object image or scan. Our method is based on retrieving an existing shape from a database of 3D models and then deforming its parts to match the target shape. Unlike previous approaches that independently focus on either shape retrieval or deformation, we propose a joint learning procedure that simultaneously trains the neural deformation module along with the embedding space used by the retrieval module. This enables our network to learn a deformation-aware embedding space, so that retrieved models are more amenable to match the target after an appropriate deformation. In fact, we use the embedding space to guide the shape pairs used to train the deformation module, so that it invests its capacity in learning deformations between meaningful shape pairs. Furthermore, our novel part-aware deformation module can work with inconsistent and diverse part-structures on the source shapes. We demonstrate the benefits of our joint training not only on our novel framework, but also on other state-of-the-art neural deformation modules proposed in recent years. Lastly, we also show that our jointly-trained method outperforms various non-joint baselines.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We introduce a new problem of retrieving 3D models that are deformable to a given query shape and present a novel deep deformation-aware embedding to solve this retrieval task. 3D model retrieval is a fundamental operation for recovering a clean and complete 3D model from a noisy and partial 3D scan. However, given a finite collection of 3D shapes, even the closest model to a query may not be satisfactory. This motivates us to apply 3D model deformation techniques to adapt the retrieved model so as to better fit the query. Yet, certain restrictions are enforced in most 3D deformation techniques to preserve important features of the original model that prevent a perfect fitting of the deformed model to the query. This gap between the deformed model and the query induces asymmetric relationships among the models, which cannot be handled by typical metric learning techniques. Thus, to retrieve the best models for fitting, we propose a novel deep embedding approach that learns the asymmetric relationships by leveraging location-dependent egocentric distance fields. We also propose two strategies for training the embedding network. We demonstrate that both of these approaches outperform other baselines in our experiments with both synthetic and real data. Our project page can be found at 
    more » « less
  2. Telecystoscopy can lower the barrier to access critical urologic diagnostics for patients around the world. A major challenge for robotic control of flexible cystoscopes and intuitive teleoperation is the pose estimation of the scope tip. We propose a novel real-time camera localization method using video recordings from a prior cystoscopy and 3D bladder reconstruction to estimate cystoscope pose within the bladder during follow-up telecystoscopy. We map prior video frames into a low-dimensional space as a dictionary so that a new image can be likewise mapped to efficiently retrieve its nearest neighbor among the dictionary images. The cystoscope pose is then estimated by the correspondence among the new image, its nearest dictionary image, and the prior model from 3D reconstruction. We demonstrate performance of our methods using bladder phantoms with varying fidelity and a servo-controlled cystoscope to simulate the use case of bladder surveillance through telecystoscopy. The servo-controlled cystoscope with 3 degrees of freedom (angulation, roll, and insertion axes) was developed for collecting cystoscope videos from bladder phantoms. Cystoscope videos were acquired in a 2.5D bladder phantom (bladder-shape cross-section plus height) with a panorama of a urothelium attached to the inner surface. Scans of the 2.5D phantom were performed in separate arc trajectories each of which is generated by actuation on the angulation with a fixed roll and insertion length. We further included variance in moving speed, imaging distance and existence of bladder tumors. Cystoscope videos were also acquired in a water-filled 3D silicone bladder phantom with hand-painted vasculature. Scans of the 3D phantom were performed in separate circle trajectories each of which is generated by actuation on the roll axis under a fixed angulation and insertion length. These videos were used to create 3D reconstructions, dictionary sets, and test data sets for evaluating the computational efficiency and accuracy of our proposed method in comparison with a method based on global Scale-Invariant Feature Transform (SIFT) features, named SIFT-only. Our method can retrieve the nearest dictionary image for 94–100% of test frames in under 55[Formula: see text]ms per image, whereas the SIFT-only method can only find the image match for 56–100% of test frames in 6000–40000[Formula: see text]ms per image depending on size of the dictionary set and richness of SIFT features in the images. Our method, with a speed of around 20 Hz for the retrieval stage, is a promising tool for real-time image-based scope localization in robotic cystoscopy when prior cystoscopy images are available. 
    more » « less
  3. Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VAT-Mart to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data. 
    more » « less
  4. Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pre-training & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work revealsa significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link

    more » « less
  5. Existing visual instance retrieval (VIR) approaches attempt to learn a faithful global matching metric or discriminative feature embedding offline to cover enormous visual appearance variations, so as to directly use it online on various unseen probes for retrieval. However, their requirement for a huge set of positive training pairs is very demanding in practice and the performance is largely constrained for the unseen testing samples due to the severe data shifting issue. In contrast, this paper advocates a different paradigm: part of the learning can be performed online but with nominal costs, so as to achieve online metric adaptation for different query probes. By exploiting easily-available negative samples, we propose a novel solution to achieve the optimal local metric adaptation effectively and efficiently. The insight of our method is the local hard negative samples can actually provide tight constraints to fine tune the metric locally. Our local metric adaptation method is generally applicable to be used on top of any offline-learned baselines. In addition, this paper gives in-depth theoretical analyses of the proposed method to guarantee the reduction of the classification error both asymptotically and practically. Extensive experiments on various VIR tasks have confirmed our effectiveness and superiority. 
    more » « less