Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
We study the problem of Open-Vocabulary Constructs (OVCs)—ones not known beforehand—in the context of converting natural language (NL) specifications into formal languages (e.g., temporal logic or code). Mod- els fare poorly on OVCs due to a lack of necessary knowledge a priori. In such situations, a domain expert can provide correct constructs at in- ference time based on their preferences or domain knowledge. Our goal is to effectively reuse this inference-time, expert-provided knowledge for future parses without retraining the model. We present dynamic knowledge- augmented parsing (DKAP), where in addition to the input sentence, the model receives (dynamically growing) expert knowledge as a key-value lexicon that associates NL phrases with correct OVC constructs. We pro- pose ROLEX, a retrieval-augmented parsing approach that uses this lexicon. A retriever and a generator are trained to find and use the key-value store to produce the correct parse. A key challenge lies in curating data for this retrieval-augmented parser. We utilize synthetic data generation and the data augmentation techniques on annotated (NL sentence, FL statement) pairs to train the augmented parser. To improve training effectiveness, we propose multiple strategies to teach models to focus on the relevant subset of retrieved knowledge. Finally, we introduce a new evaluation paradigm modeled after the DKAP problem and simulate the scenario across three formalization tasks (NL2LTL, NL2Code, and NL2CMD). Our evaluations show that DKAP is a difficult challenge, and ROLEX helps improve the performance of baseline models by using dynamic expert knowledge effectively.more » « lessFree, publicly-accessible full text available October 27, 2025
-
Free, publicly-accessible full text available October 7, 2025
-
The events in a narrative are understood as a coherent whole via the underlying states of their participants. Often, these participant states are not explicitly mentioned, instead left to be inferred by the reader. A model that understands narratives should likewise infer these implicit states, and even reason about the impact of changes to these states on the narrative. To facilitate this goal, we introduce a new crowdsourced English-language, Participant States dataset, PASTA. This dataset contains inferable participant states; a counterfactual perturbation to each state; and the changes to the story that would be necessary if the counterfactual were true. We introduce three state-based reasoning tasks that test for the ability to infer when a state is entailed by a story, to revise a story conditioned on a counterfactual state, and to explain the most likely state change given a revised story. Experiments show that today’s LLMs can reason about states to some degree, but there is large room for improvement, especially in problems requiring access and ability to reason with diverse types of knowledge (e.g. physical, numerical, factual).more » « lessFree, publicly-accessible full text available March 5, 2025
-
Knowledge about outcomes is critical for com- plex event understanding but is hard to acquire. We show that by pre-identifying a participant in a complex event, crowdworkers are able to (1) infer the collective impact of salient events that make up the situation, (2) annotate the volitional engagement of participants in causing the situation, and (3) ground the outcome of the situation in state changes of the participants. By creating a multi-step interface and a careful quality control strategy, we collect a high quality annotated dataset of 8K short newswire narratives and ROCStories with high inter-annotator agreement (0.74-0.96 weighted Fleiss Kappa). Our dataset, POQue (Participant Outcome Questions), enables the exploration and development of models that address multiple aspects of semantic understanding. Experimentally, we show that current language models lag behind human performance in subtle ways through our task formulations that target abstract and specific comprehension of a complex event, its outcome, and a participant’s influence over the event culmination.more » « less
-
Knowledge about outcomes is critical for complex event understanding but is hard to acquire. We show that by pre-identifying a participant in a complex event, crowdworkers are able to (1) infer the collective impact of salient events that make up the situation, (2) annotate the volitional engagement of participants in causing the situation, and (3) ground the outcome of the situation in state changes of the participants. By creating a multi-step interface and a careful quality control strategy, we collect a high quality annotated dataset of 8K short newswire narratives and ROCStories with high inter-annotator agreement (0.74-0.96 weighted Fleiss Kappa). Our dataset, POQue (Participant Outcome Questions), enables the exploration and development of models that address multiple aspects of semantic understanding. Experimentally, we show that current language models lag behind human performance in subtle ways through our task formulations that target abstract and specific comprehension of a complex event, its outcome, and a participant’s influence over the event culmination.more » « less
-
Can NLP assist in building formal models for verifying complex systems? We study this challenge in the context of parsing Network File System (NFS) specifications. We define a semantic-dependency problem over SpecIR, a representation language we introduce to model sentences appearing in NFS specification documents (RFCs) as semantic dependency structures, and present an annotated dataset of 1,198 sentences. We develop and evaluate semantic-dependency parsing systems for this problem. Evaluations show that even when using a state-of-the-art language model, there is significant room for improvement, with the best models achieving an F1 score of only 60.5 and 33.3 in the named-entity-recognition and dependency-link-prediction sub-tasks, respectively. We also release additional unlabeled data and other domain-related texts. Experiments show that these additional resources increase the F1 measure when used for simple domain-adaption and transfer-learning-based approaches, suggesting fruitful directions for further research.more » « less