NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Probres: Probabilistic jump diffusion for open-world egocentric activity recognition

Kundu, Sanjoy; Vellamcheti, Shanmukha; Aakur, Sathyanarayanan N (October 2025, IEEE International Conference on Computer Vision (ICCV), 2025)

Free, publicly-accessible full text available October 30, 2026
Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning

Kundu, Sanjoy; Trehan, Shubham; Aakur, Sathyanarayanan N (November 2024, Springer)
Leonardis, Aleš; Ricci, Elisa; Roth, Stefan; Russakovsky, Olga; Sattler, Torsten; Varol, Gül (Ed.)
Learning to infer labels in an open world, i.e., in an environment where the target “labels” are unknown, is an important characteristic for achieving autonomy. Foundation models, pre-trained on enormous amounts of data, have shown remarkable generalization skills through prompting, particularly in zero-shot inference. However, their performance is restricted to the correctness of the target label’s search space, i.e., candidate labels provided in the prompt. This target search space can be unknown or exceptionally large in an open world, severely restricting their performance. To tackle this challenging problem, we propose a two-step, neuro-symbolic framework called ALGO - Action Learning with Grounded Object recognition that uses symbolic knowledge stored in large-scale knowledge bases to infer activities in egocentric videos with limited supervision. First, we propose a neuro-symbolic prompting approach that uses object-centric vision-language models as a noisy oracle to ground objects in the video through evidence-based reasoning. Second, driven by prior commonsense knowledge, we discover plausible activities through an energy-based symbolic pattern theory framework and learn to ground knowledge-based action (verb) concepts in the video. Extensive experiments on four publicly available datasets (EPIC-Kitchens, GTEA Gaze, GTEA Gaze Plus, and Charades-Ego) demonstrate its performance on open-world activity inference. ALGO can be extended to zero-shot inference and demonstrate its competitive performance.
more » « less
Full Text Available
EASE: Embodied Active Event Perception via Self-Supervised Energy Minimization

https://doi.org/10.1109/LRA.2025.3583626

Chen, Zhou; Kundu, Sanjoy; Baweja, Harsimran S; Aakur, Sathyanarayanan N (August 2025, IEEE Robotics and Automation Letters)

Free, publicly-accessible full text available August 1, 2026
IS-GGT: Iterative Scene Graph Generation with Generative Transformers

https://doi.org/10.1109/CVPR52729.2023.00609

Kundu, Sanjoy; Aakur, Sathyanarayanan N. (June 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.
more » « less
Full Text Available
Knowledge guided learning: Open world egocentric action recognition with zero supervision

https://doi.org/10.1016/j.patrec.2022.03.007

Aakur, Sathyanarayanan N.; Kundu, Sanjoy; Gunti, Nikhil (April 2022, Pattern Recognition Letters)

Full Text Available

Search for: All records