skip to main content

Search for: All records

Creators/Authors contains: "Kitani, Kris"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. As assistive robotics has expanded to many task domains, comparing assistive strategies among the varieties of research becomes increasingly difficult. To begin to unify the disparate domains into a more general theory of assistance, we present a definition of assistance, a survey of existing work, and three key design axes that occur in many domains and benefit from the examination of assistance as a whole. We first define an assistance perspective that focuses on understanding a robot that is in control of its actions but subordinate to a user’s goals. Next, we use this perspective to explore design axes that arise from the problem of assistance more generally and explore how these axes have comparable trade-offs across many domains. We investigate how the assistive robot handles other people in the interaction, how the robot design can operate in a variety of action spaces to enact similar goals, and how assistive robots can vary the timing of their actions relative to the user’s behavior. While these axes are by no means comprehensive, we propose them as useful tools for unifying assistance research across domains and as examples of how taking a broader perspective on assistance enables more cross-domain theorizing about assistance.
  2. We present the Human And Robot Multimodal Observations of Natural Interactive Collaboration (HARMONIC) dataset. This is a large multimodal dataset of human interactions with a robotic arm in a shared autonomy setting designed to imitate assistive eating. The dataset provides human, robot, and environmental data views of 24 different people engaged in an assistive eating task with a 6-degree-of-freedom (6-DOF) robot arm. From each participant, we recorded video of both eyes, egocentric video from a head-mounted camera, joystick commands, electromyography from the forearm used to operate the joystick, third-person stereo video, and the joint positions of the 6-DOF robot arm. Also included are several features that come as a direct result of these recordings, such as eye gaze projected onto the egocentric video, body pose, hand pose, and facial keypoints. These data streams were collected specifically because they have been shown to be closely related to human mental states and intention. This dataset could be of interest to researchers studying intention prediction, human mental state modeling, and shared autonomy. Data streams are provided in a variety of formats such as video and human-readable CSV and YAML files.

  3. Toward enabling next-generation robots capable of socially intelligent interaction with humans, we present a computational model of interactions in a social environment of multiple agents and multiple groups. The Multiagent Group Perception and Interaction (MGpi) network is a deep neural network that predicts the appropriate social action to execute in a group conversation (e.g., speak, listen, respond, leave), taking into account neighbors' observable features (e.g., location of people, gaze orientation, distraction, etc.). A central component of MGpi is the Kinesic-Proxemic-Message (KPM) gate, that performs social signal gating to extract important information from a group conversation. In particular, KPM gate filters incoming social cues from nearby agents by observing their body gestures (kinesics) and spatial behavior (proxemics). The MGpi network and its KPM gate are learned via imitation learning, using demonstrations from our designed social interaction simulator. Further, we demonstrate the efficacy of the KPM gate as a social attention mechanism, achieving state-of-the-art performance on the task of group identification without using explicit group annotations, layout assumptions, or manually chosen parameters.
  4. NavCog3 is a smartphone turn-by-turn navigation assistant system we developed specifically designed to enable independent navigation for people with visual impairments. Using off-the-shelf Bluetooth beacons installed in the surrounding environment and a commodity smartphone carried by the user, NavCog3 achieves unparalleled localization accuracy in real-world large-scale scenarios. By leveraging its accurate localization capabilities, NavCog3 guides the user through the environment and signals the presence of semantic features and points of interest in the vicinity (e.g., doorways, shops).To assess the capability of NavCog3 to promote independent mobility of individuals with visual impairments, we deployed and evaluated the system in two challenging real-world scenarios. The first scenario demonstrated the scalability of the system, which was permanently installed in a five-story shopping mall spanning three buildings and a public underground area. During the study, 10 participants traversed three fixed routes, and 43 participants traversed free-choice routes across the environment. The second scenario validated the system’s usability in the wild in a hotel complex temporarily equipped with NavCog3 during a conference for individuals with visual impairments. In the hotel, almost 14.2h of system usage data were collected from 37 unique users who performed 280 travels across the environment, for a total of 30,200m
  5. We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel large-scale dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN ismore »the best model for predicting time to near-collision. The average prediction error of our time to near collision is 0.75 seconds across our test environments.« less
  6. The use of random perturbations of ground truth data, such as random translation or scaling of bounding boxes, is a common heuristic used for data augmentation that has been shown to prevent overfitting and improve generalization. Since the design of data augmentation is largely guided by reported best practices, it is difficult to understand if those design choices are optimal. To provide a more principled perspective, we develop a game-theoretic interpretation of data augmentation in the context of object detection. We aim to find an optimal adversarial perturbations of the ground truth data (i.e., the worst case perturbations) that forces the object bounding box predictor to learn from the hardest distribution of perturbed examples for better test-time performance. We establish that the game-theoretic solution (Nash equilibrium) provides both an optimal predictor and optimal data augmentation distribution. We show that our adversarial method of training a predictor can significantly improve test-time performance for the task of object detection. On the ImageNet, Pascal VOC and MS-COCO object detection tasks, our adversarial approach improves performance by about 16%, 5%, and 2% respectively compared to the best performing data augmentation methods.
  7. We study self-supervised adaptation of a robot's policy for social interaction, i.e., a policy for active communication with surrounding pedestrians through audio or visual signals. Inspired by the observation that humans continually adapt their behavior when interacting under varying social context, we propose Adaptive EXP4 (A-EXP4), a novel online learning algorithm for adapting the robot-pedestrian interaction policy. To address limitations of bandit algorithms in adaptation to unseen and highly dynamic scenarios, we employ a mixture model over the policy parameter space. Specifically, a Dirichlet Process Gaussian Mixture Model (DPMM) is used to cluster the parameters of sampled policies and maintain a mixture model over the clusters, hence effectively discovering policies that are suitable to the current environmental context in an unsupervised manner. Our simulated and real-world experiments demonstrate the feasibility of A-EXP4 in accommodating interaction with different types of pedestrians while jointly minimizing social disruption through the adaptation process. While the A-EXP4 formulation is kept general for application in a variety of domains requiring continual adaptation of a robot's policy, we specifically evaluate the performance of our algorithm using a suitcase-inspired assistive robotic platform. In this concrete assistive scenario, the algorithm observes how audio signals produced by the navigational systemmore »affect the behavior of pedestrians and adapts accordingly. Consequently, we find A-EXP4 to effectively adapt the interaction policy for gently clearing a navigation path in crowded settings, resulting in significant reduction in empirical regret compared to the EXP4 baseline.« less