skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 26, 2026

Title: The FineView Dataset: A 3D Scanned Multi-View Object Dataset of Fine-Grained Category Instances
Nature and wildlife observation is the practice of notign both the occurrence and abundance of plant or animal species at a specific location and time. Common exam-ples of this type of activity are bird watching (birding), insect collecting, and plant observation (botanizing), and these are widely accepted as both recreational and scien-tific activities in their respective fields. However, many highly-similar species are difficult to disambiguate; identi-fying an observed specimen requires expert knowledge and experience in many cases. This hard problem is called Fine-grained Visual Categorization (FGVC) and focuses on dif-ferentiating between hard-to-distinguish object classes. Ex-amples of such fine-level classification include discriminatign between similar species of plants and animals or iden-tifying the make and model of vehicles, instead of recognizing these objects at a coarse level. An FGVC example of butterflies is shown in Figure 1. These two species have similar colors and shapes, but the patterns on the wings are distinct. When presented with near-identical poses as in the figure, this classification can be performed very effectively by a machine. However, in more extreme conditions of pose, illumination, occlusion, etc, the task becomes much harder. While machines struggle in such scenarios, humans can still find the needed visual cues and differences by fac-toring in the pose of the butterfly and comparing patterns on common parts; in part, because humans can infer an ob-ject's rough 3D shape, understand the lighting and camera angle, and even envision what it would look like from an-other pose. Humans have developed a 3D understanding of a butterfly because we have seen moving butterflies previ-ously. What if machines had the same information about the object? Information such as object pose, camera angle, object texture, and part labels, would undoubtedly help im-prove performance on the FGVC task.  more » « less
Award ID(s):
1651832
PAR ID:
10630063
Author(s) / Creator(s):
;
Publisher / Repository:
IEEE
Date Published:
ISSN:
2642-9381
ISBN:
979-8-3315-1083-1
Page Range / eLocation ID:
5623 to 5634
Format(s):
Medium: X
Location:
Tucson, AZ, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. 3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. We provide several insights that are applicable to both convolutional and transformer models, leading to better performance. Based on our findings, we also present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on 4 diverse datasets (H2O, AssemblyHands, Epic-Kitchens, Ego-Exo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% – 66%. In system level comparisons, WildHands achieves the best 3D hand pose on ARCTIC egocentric split, outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10× smaller and trained on 5× less data. 
    more » « less
  2. 3D hand pose estimation in everyday egocentric images is challenging for several reasons: poor visual signal (occlusion from the object of interaction, low resolution & motion blur), large perspective distortion (hands are close to the camera), and lack of 3D annotations outside of controlled settings. While existing methods often use hand crops as input to focus on fine-grained visual information to deal with poor visual signal, the challenges arising from perspective distortion and lack of 3D annotations in the wild have not been systematically studied. We focus on this gap and explore the impact of different practices, i.e. crops as input, incorporating camera information, auxiliary supervision, scaling up datasets. We provide several insights that are applicable to both convolutional and transformer models, leading to better performance. Based on our findings, we also present WildHands, a system for 3D hand pose estimation in everyday egocentric images. Zero-shot evaluation on 4 diverse datasets (H2O, AssemblyHands, Epic-Kitchens, Ego-Exo4D) demonstrate the effectiveness of our approach across 2D and 3D metrics, where we beat past methods by 7.4% – 66%. In system level comparisons, WildHands achieves the best 3D hand pose on ARCTIC egocentric split, outperforms FrankMocap across all metrics and HaMeR on 3 out of 6 metrics while being 10× smaller and trained on 5× less data. 
    more » « less
  3. Humans often use natural language instructions to control and interact with robots for task execution. This poses a big challenge to robots that need to not only parse and understand human instructions but also realise semantic understanding of an unknown environment and its constituent elements. To address this challenge, this study presents a vision-language model (VLM)-driven approach to scene understanding of an unknown environment to enable robotic object manipulation. Given language instructions, a pretrained vision-language model built on open-sourced Llama2-chat (7B) as the language model backbone is adopted for image description and scene understanding, which translates visual information into text descriptions of the scene. Next, a zero-shot-based approach to fine-grained visual grounding and object detection is developed to extract and localise objects of interest from the scene task. Upon 3D reconstruction and pose estimate establishment of the object, a code-writing large language model (LLM) is adopted to generate high-level control codes and link language instructions with robot actions for downstream tasks. The performance of the developed approach is experimentally validated through table-top object manipulation by a robot. 
    more » « less
  4. null (Ed.)
    Monocular estimation of 3d human pose has attracted in- creased attention with the availability of large ground-truth motion capture datasets. However, the diversity of training data available is limited and it is not clear to what extent methods generalize outside the specific datasets they are trained on. In this work we carry out a systematic study of the diversity and biases present in specific datasets and its e↵ect on cross-dataset generalization across a compendium of 5 pose datasets. We specifically focus on systematic di↵erences in the distri- bution of camera viewpoints relative to a body-centered coordinate frame. Based on this observation, we propose an auxiliary task of predicting the camera viewpoint in addition to pose. We find that models trained to jointly predict viewpoint and pose systematically show significantly improved cross-dataset generalization. 
    more » « less
  5. Fine-Grained Visual Classification (FGVC) datasets contain small sample sizes, along with significant intra-class variation and inter-class similarity. While prior work has addressed intra-class variation using localization and segmentation techniques, inter-class similarity may also affect feature learning and reduce classification performance. In this work, we address this problem using a novel optimization procedure for the end-to-end neural network training on FGVC tasks. Our procedure, called Pairwise Confusion (PC) reduces overfitting by intentionally introducing confusion in the activations. With PC regularization, we obtain state-of-the-art performance on six of the most widely-used FGVC datasets and demonstrate improved localization ability. PC is easy to implement, does not need excessive hyperparameter tuning during training, and does not add significant overhead during test time. 
    more » « less