skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 12, 2025

Title: CareCorpus+: Expanding and Augmenting Caregiver Strategy Data to Support Pediatric Rehabilitation
Caregiver strategy classification in pediatric rehabilitation contexts is strongly motivated by real-world clinical constraints but highly under-resourced and seldom studied in natural language processing settings. We introduce a large dataset of 4,037 caregiver strategies in this setting, a five-fold increase over the nearest contemporary dataset. These strategies are manually categorized into clinically established constructs with high agreement (đťś…=0.68-0.89). We also propose two techniques to further address identified data constraints. First, we manually supplement target task data with publicly relevant data from online child health forums. Next, we propose a novel data augmentation technique to generate synthetic caregiver strategies with high downstream task utility. Extensive experiments showcase the quality of our dataset. They also establish evidence that both the publicly available data and the synthetic strategies result in large performance gains, with relative F1 increases of 22.6% and 50.9%, respectively.  more » « less
Award ID(s):
2125411
PAR ID:
10636435
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
6912 to 6927
Format(s):
Medium: X
Location:
Miami, Florida, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. This work summarizes the results of the first Competition on Harvesting Raw Tables from Infographics (ICDAR 2019 CHART-Infographics). The complex process of automatic chart recognition is divided into multiple tasks for the purpose of this competition, including Chart Image Classification (Task 1), Text Detection and Recognition (Task 2), Text Role Classification (Task 3), Axis Analysis (Task 4), Legend Analysis (Task 5), Plot Element Detection and Classification (Task 6.a), Data Extraction (Task 6.b), and End-to-End Data Extraction (Task 7). We provided a large synthetic training set and evaluated submitted systems using newly proposed metrics on both synthetic charts and manually-annotated real charts taken from scientific literature. A total of 8 groups registered for the competition out of which 5 submitted results for tasks 1-5. The results show that some tasks can be performed highly accurately on synthetic data, but all systems did not perform as well on real world charts. The data, annotation tools, and evaluation scripts have been publicly released for academic use. 
    more » « less
  2. Abstract Customizing participation-focused pediatric rehabilitation interventions is an important but also complex and potentially resource intensive process, which may benefit from automated and simplified steps. This research aimed at applying natural language processing to develop and identify a best performing predictive model that classifies caregiver strategies into participation-related constructs, while filtering out non-strategies. We created a dataset including 1,576 caregiver strategies obtained from 236 families of children and youth (11–17 years) with craniofacial microsomia or other childhood-onset disabilities. These strategies were annotated to four participation-related constructs and a non-strategy class. We experimented with manually created features (i.e., speech and dependency tags, predefined likely sets of words, dense lexicon features (i.e., Unified Medical Language System (UMLS) concepts)) and three classical methods (i.e., logistic regression, naĂŻve Bayes, support vector machines (SVM)). We tested a series of binary and multinomial classification tasks applying 10-fold cross-validation on the training set (80%) to test the best performing model on the held-out test set (20%). SVM using term frequency-inverse document frequency (TF-IDF) was the best performing model for all four classification tasks, with accuracy ranging from 78.10 to 94.92% and a macro-averaged F1-score ranging from 0.58 to 0.83. Manually created features only increased model performance when filtering out non-strategies. Results suggest pipelined classification tasks (i.e., filtering out non-strategies; classification into intrinsic and extrinsic strategies; classification into participation-related constructs) for implementation into participation-focused pediatric rehabilitation interventions like Participation and Environment Measure Plus (PEM+) among caregivers who complete the Participation and Environment Measure for Children and Youth (PEM-CY). 
    more » « less
  3. The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions. 
    more » « less
  4. Answering complex questions that require making latent decisions is a challenging task, especially when limited supervision is available. Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting by demonstrating how to output intermediate rationalizations while solving the complex question in a single pass. We introduce “Successive Prompting” where, we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Successive prompting decouples the supervision for decomposing complex questions from the supervision for answering simple questions, allowing us to (1) have multiple opportunities to query in-context examples at each reasoning step (2) learn question decomposition separately from question answering, including using synthetic data, and (3) use bespoke (fine-tuned) components for reasoning steps where a large LM does not perform well. The intermediate supervision is typically manually written, which can be expensive to collect. We introduce a way to generate synthetic dataset which can be used to bootstrap model’s ability to decompose and answer intermediate questions. Our best model (with successive prompting) achieves an improvement in F1 of ~5% when compared with a state-of-the-art model with synthetic augmentations and few-shot version of the DROP dataset. 
    more » « less
  5. null (Ed.)
    Modelling persuasion strategies as predictors of task outcome has several real-world applications and has received considerable attention from the computational linguistics community. However, previous research has failed to account for the resisting strategies employed by an individual to foil such persuasion attempts. Grounded in prior literature in cognitive and social psychology, we propose a generalised framework for identifying resisting strategies in persuasive conversations. We instantiate our framework on two distinct datasets comprising persuasion and negotiation conversations. We also leverage a hierarchical sequence-labelling neural architecture to infer the aforementioned resisting strategies automatically. Our experiments reveal the asymmetry of power roles in non-collaborative goal-directed conversations and the benefits accrued from incorporating resisting strategies on the final conversation outcome. We also investigate the role of different resisting strategies on the conversation outcome and glean insights that corroborate with past findings. We also make the code and the dataset of this work publicly available at this https URL. 
    more » « less