NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ExpertAF: Expert Actionable Feedback from Video

https://doi.org/10.1109/CVPR52734.2025.01268

Ashutosh, Kumar; Nagarajan, Tushar; Pavlakos, Georgios; Kitani, Kris; Grauman, Kristen (June 2025, IEEE)

Feedback is essential for learning a new skill or improving one’s current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D’s [29] videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output fullspectrum, actionable coaching—expert commentary, expert video retrieval, and expert pose generation—outperforming strong vision-language models on both established metrics and human preference studies.
more » « less
Free, publicly-accessible full text available June 10, 2026
Helping People Through Space and Time: Assistance as a Perspective on Human-Robot Interaction

https://doi.org/10.3389/frobt.2021.720319

Newman, Benjamin A.; Aronson, Reuben M.; Kitani, Kris; Admoni, Henny (January 2022, Frontiers in Robotics and AI)

As assistive robotics has expanded to many task domains, comparing assistive strategies among the varieties of research becomes increasingly difficult. To begin to unify the disparate domains into a more general theory of assistance, we present a definition of assistance, a survey of existing work, and three key design axes that occur in many domains and benefit from the examination of assistance as a whole. We first define an assistance perspective that focuses on understanding a robot that is in control of its actions but subordinate to a user’s goals. Next, we use this perspective to explore design axes that arise from the problem of assistance more generally and explore how these axes have comparable trade-offs across many domains. We investigate how the assistive robot handles other people in the interaction, how the robot design can operate in a variety of action spaces to enact similar goals, and how assistive robots can vary the timing of their actions relative to the user’s behavior. While these axes are by no means comprehensive, we propose them as useful tools for unifying assistance research across domains and as examples of how taking a broader perspective on assistance enables more cross-domain theorizing about assistance.
more » « less
Full Text Available
MGpi: A Computational Model of Multiagent Group Perception and Interaction

https://doi.org/10.5555/3398761.3398900

Sanghvi, Navyata; Yonetani, Ryo; Kitani, Kris (May 2020, AAMAS '20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems)

Toward enabling next-generation robots capable of socially intelligent interaction with humans, we present a computational model of interactions in a social environment of multiple agents and multiple groups. The Multiagent Group Perception and Interaction (MGpi) network is a deep neural network that predicts the appropriate social action to execute in a group conversation (e.g., speak, listen, respond, leave), taking into account neighbors' observable features (e.g., location of people, gaze orientation, distraction, etc.). A central component of MGpi is the Kinesic-Proxemic-Message (KPM) gate, that performs social signal gating to extract important information from a group conversation. In particular, KPM gate filters incoming social cues from nearby agents by observing their body gestures (kinesics) and spatial behavior (proxemics). The MGpi network and its KPM gate are learned via imitation learning, using demonstrations from our designed social interaction simulator. Further, we demonstrate the efficacy of the KPM gate as a social attention mechanism, achieving state-of-the-art performance on the task of group identification without using explicit group annotations, layout assumptions, or manually chosen parameters.
more » « less
Full Text Available
Virtual navigation for blind people: Transferring route knowledge to the real-World

https://doi.org/10.1016/j.ijhcs.2019.102369

Guerreiro, João; Sato, Daisuke; Ahmetovic, Dragan; Ohn-Bar, Eshed; Kitani, Kris M.; Asakawa, Chieko (March 2020, International Journal of Human-Computer Studies)

Full Text Available
ADA: Adversarial Data Augmentation for Object Detection

Behpour, Sima; Kitani, Kris; Ziebart, Brian D. (January 2019, IEEE Winter Conf. on Applications of Computer Vision)

The use of random perturbations of ground truth data, such as random translation or scaling of bounding boxes, is a common heuristic used for data augmentation that has been shown to prevent overfitting and improve generalization. Since the design of data augmentation is largely guided by reported best practices, it is difficult to understand if those design choices are optimal. To provide a more principled perspective, we develop a game-theoretic interpretation of data augmentation in the context of object detection. We aim to find an optimal adversarial perturbations of the ground truth data (i.e., the worst case perturbations) that forces the object bounding box predictor to learn from the hardest distribution of perturbed examples for better test-time performance. We establish that the game-theoretic solution (Nash equilibrium) provides both an optimal predictor and optimal data augmentation distribution. We show that our adversarial method of training a predictor can significantly improve test-time performance for the task of object detection. On the ImageNet, Pascal VOC and MS-COCO object detection tasks, our adversarial approach improves performance by about 16%, 5%, and 2% respectively compared to the best performing data augmentation methods.
more » « less
Full Text Available
Future Near-Collision Prediction from Monocular Video: Feasibility, Dataset, and Challenges

MANGLIK, AASHI; WENG, XINSHUO; OHN-BAR, ESHED; KITANI, KRIS (April 2019, IEEE/RSJ International Conference on Intelligent Robots and Systems)

We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel large-scale dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near collision is 0.75 seconds across our test environments.
more » « less
Full Text Available
A-EXP4: Online Social Policy Learning for Adaptive Robot-Pedestrian Interaction

https://doi.org/10.1109/IROS40897.2019.8967737

Jin, Pengju; Ohn-Bar, Eshed; Kitani, Kris; Asakawa, Chieko (January 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS))

We study self-supervised adaptation of a robot's policy for social interaction, i.e., a policy for active communication with surrounding pedestrians through audio or visual signals. Inspired by the observation that humans continually adapt their behavior when interacting under varying social context, we propose Adaptive EXP4 (A-EXP4), a novel online learning algorithm for adapting the robot-pedestrian interaction policy. To address limitations of bandit algorithms in adaptation to unseen and highly dynamic scenarios, we employ a mixture model over the policy parameter space. Specifically, a Dirichlet Process Gaussian Mixture Model (DPMM) is used to cluster the parameters of sampled policies and maintain a mixture model over the clusters, hence effectively discovering policies that are suitable to the current environmental context in an unsupervised manner. Our simulated and real-world experiments demonstrate the feasibility of A-EXP4 in accommodating interaction with different types of pedestrians while jointly minimizing social disruption through the adaptation process. While the A-EXP4 formulation is kept general for application in a variety of domains requiring continual adaptation of a robot's policy, we specifically evaluate the performance of our algorithm using a suitcase-inspired assistive robotic platform. In this concrete assistive scenario, the algorithm observes how audio signals produced by the navigational system affect the behavior of pedestrians and adapts accordingly. Consequently, we find A-EXP4 to effectively adapt the interaction policy for gently clearing a navigation path in crowded settings, resulting in significant reduction in empirical regret compared to the EXP4 baseline.
more » « less
Full Text Available
NavCog3 in the Wild: Large-scale Blind Indoor Navigation Assistant with Semantic Features

https://doi.org/10.1145/3340319

SATO, DAISUKE; OH, URAN; GUERREIRO, JOÃO; AHMETOVIC, DRAGAN; NAITO, KAKUYA; TAKAGI, HIRONOBU; KITANI, KRIS; ASAKAWA, CHIEKO (September 2019, ACM Transactions Accessible Computing)

NavCog3 is a smartphone turn-by-turn navigation assistant system we developed specifically designed to enable independent navigation for people with visual impairments. Using off-the-shelf Bluetooth beacons installed in the surrounding environment and a commodity smartphone carried by the user, NavCog3 achieves unparalleled localization accuracy in real-world large-scale scenarios. By leveraging its accurate localization capabilities, NavCog3 guides the user through the environment and signals the presence of semantic features and points of interest in the vicinity (e.g., doorways, shops).To assess the capability of NavCog3 to promote independent mobility of individuals with visual impairments, we deployed and evaluated the system in two challenging real-world scenarios. The first scenario demonstrated the scalability of the system, which was permanently installed in a five-story shopping mall spanning three buildings and a public underground area. During the study, 10 participants traversed three fixed routes, and 43 participants traversed free-choice routes across the environment. The second scenario validated the system’s usability in the wild in a hotel complex temporarily equipped with NavCog3 during a conference for individuals with visual impairments. In the hotel, almost 14.2h of system usage data were collected from 37 unique users who performed 280 travels across the environment, for a total of 30,200m
more » « less
Full Text Available
HARMONIC: A multimodal dataset of assistive human–robot collaboration

https://doi.org/10.1177/02783649211050677

Newman, Benjamin_A; Aronson, Reuben_M; Srinivasa, Siddhartha_S; Kitani, Kris; Admoni, Henny (December 2021, The International Journal of Robotics Research)

We present the Human And Robot Multimodal Observations of Natural Interactive Collaboration (HARMONIC) dataset. This is a large multimodal dataset of human interactions with a robotic arm in a shared autonomy setting designed to imitate assistive eating. The dataset provides human, robot, and environmental data views of 24 different people engaged in an assistive eating task with a 6-degree-of-freedom (6-DOF) robot arm. From each participant, we recorded video of both eyes, egocentric video from a head-mounted camera, joystick commands, electromyography from the forearm used to operate the joystick, third-person stereo video, and the joint positions of the 6-DOF robot arm. Also included are several features that come as a direct result of these recordings, such as eye gaze projected onto the egocentric video, body pose, hand pose, and facial keypoints. These data streams were collected specifically because they have been shown to be closely related to human mental states and intention. This dataset could be of interest to researchers studying intention prediction, human mental state modeling, and shared autonomy. Data streams are provided in a variety of formats such as video and human-readable CSV and YAML files.
more » « less
Variability in Reactions to Instructional Guidance during Smartphone-Based Assisted Navigation of Blind Users

https://doi.org/10.1145/3264941

Ohn-Bar, Eshed; Guerreiro, João; Kitani, Kris; Asakawa, Chieko (September 2018, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies)

Full Text Available

« Prev Next »

Search for: All records