Interest in physical therapy and individual exercises such as yoga/dance has increased alongside the well-being trend, and people globally enjoy such exercises at home/office via video streaming platforms. However, such exercises are hard to follow without expert guidance. Even if experts can help, it is almost impossible to give personalized feedback to every trainee remotely. Thus, automated pose correction systems are required more than ever, and we introduce a new captioning dataset named FixMyPose to address this need. We collect natural language descriptions of correcting a “current” pose to look like a “target” pose. To support a multilingual setup, we collect descriptions in both English and Hindi. The collected descriptions have interesting linguistic properties such as egocentric relations to the environment objects, analogous references, etc., requiring an understanding of spatial relations and commonsense knowledge about postures. Further, to avoid ML biases, we maintain a balance across characters with diverse demographics, who perform a variety of movements in several interior environments (e.g., homes, offices). From our FixMyPose dataset, we introduce two tasks: the pose-correctional-captioning task and its reverse, the target-pose-retrieval task. During the correctional-captioning task, models must generate the descriptions of how to move from the current to the target posemore »
ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments
For embodied agents, navigation is an important ability but not an isolated goal. Agents are also expected to perform specific tasks after reaching the target location, such as picking up objects and assembling them into a particular arrangement. We combine Vision-andLanguage Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ARRAMON. During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language (English) instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment. To support this task, we implement a 3D dynamic environment simulator and collect a dataset with human-written navigation and assembling instructions, and the corresponding ground truth trajectories. We also filter the collected instructions via a verification stage, leading to a total of 7.7K task instances (30.8K instructions and paths). We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Page Range or eLocation-ID:
- 3910 to 3927
- Sponsoring Org:
- National Science Foundation
More Like this
Despite the phenomenal advances in the computational power and functionality of electronic systems, human-machine interaction has largely been limited to simple control panels, keyboard, mouse and display. Consequently, these systems either rely critically on close human guidance or operate almost independently from the user. An exemplar technology integrated tightly into our lives is the smartphone. However, the term “smart” is a misnomer, since it has fundamentally no intelligence to understand its user. The users still have to type, touch or speak (to some extent) to express their intentions in a form accessible to the phone. Hence, intelligent decision making is still almost entirely a human task. A life-changing experience can be achieved by transforming machines from passive tools to agents capable of understanding human physiology and what their user wants . This can advance human capabilities in unimagined ways by building a symbiotic relationship to solve real world problems cooperatively. One of the high-impact application areas of this approach is assistive internet of things (IoT) technologies for physically challenged individuals. The Annual World Report on Disability reveals that 15% of the world population lives with disability, while 110 to 190 million of these people have difficulty in functioning . Qualitymore »
Many real-world tasks require agents to coordinate their behavior to achieve shared goals. Successful collaboration requires not only adopting the same communicative conventions, but also grounding these conventions in the same task-appropriate conceptual abstractions. We investigate how humans use natural language to collaboratively solve physical assembly problems more effectively over time. Human participants were paired up in an online environment to reconstruct scenes containing two block towers. One participant could see the target towers, and sent assembly instructions for the other participant to reconstruct. Participants provided increasingly concise instructions across repeated attempts on each pair of towers, using more abstract referring expressions that captured each scene's hierarchical structure. To explain these findings, we extend recent probabilistic models of ad hoc convention formation with an explicit perceptual learning mechanism. These results shed light on the inductive biases that enable intelligent agents to coordinate upon shared procedural abstractions.
Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)The Temple University Hospital Seizure Detection Corpus (TUSZ)  has been in distribution since April 2017. It is a subset of the TUH EEG Corpus (TUEG)  and the most frequently requested corpus from our 3,000+ subscribers. It was recently featured as the challenge task in the Neureka 2020 Epilepsy Challenge . A summary of the development of the corpus is shown below in Table 1. The TUSZ Corpus is a fully annotated corpus, which means every seizure event that occurs within its files has been annotated. The data is selected from TUEG using a screening process that identifies files most likely to contain seizures . Approximately 7% of the TUEG data contains a seizure event, so it is important we triage TUEG for high yield data. One hour of EEG data requires approximately one hour of human labor to complete annotation using the pipeline described below, so it is important from a financial standpoint that we accurately triage data. A summary of the labels being used to annotate the data is shown in Table 2. Certain standards are put into place to optimize the annotation process while not sacrificing consistency. Due to the nature of EEG recordings, some recordsmore »
We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task. Significant progress has been made in recent years, especially for tasks with short horizons. However, when it comes to long-horizon tasks with extended sequences of actions, an agent can easily ignore some instructions or get stuck in the middle of the long instructions and eventually fail the task. To address this challenge, we propose a model-agnostic milestone-based task tracker(M-TRACK) to guide the agent and monitor its progress. Specifcally, we propose a milestone builder that tags the instructions with navigation and interaction milestones which the agent needs to complete step by step, and a milestone checker that systemically checks the agent’s progress in its current milestone and determines when to proceed to the next. On the challenging ALFRED dataset, our M-TRACK leads to a notable 33% and 52% relative improvement in unseen success rate over two competitive base models.