We study the task of conducting structured reasoning as generating a reasoning graph from natural language input using large language models (LLMs). Previous approaches have explored various prompting schemes, yet they suffer from error propagation due to the autoregressive nature and single-pass-based decoding, which lack error correction capability. Additionally, relying solely on a single sample may result in the omission of true nodes and edges. To counter this, we draw inspiration from self-consistency (SC), which involves sampling a diverse set of reasoning chains and taking the majority vote as the final answer. To tackle the substantial challenge of applying SC on generated graphs, we propose MIDGARD (MInimum Description length Guided Aggregation of Reasoning in Directed acyclic graph) that leverages Minimum Description Length (MDL)-based formulation to identify consistent properties among the different graph samples generated by an LLM. This formulation helps reject properties that appear in only a few samples, which are likely to be erroneous, while enabling the inclusion of missing elements without compromising precision. Our method demonstrates superior performance than comparisons across various structured reasoning tasks, including argument structure extraction, explanation graph generation, inferring dependency relations among actions for everyday tasks, and semantic graph generation from natural texts. 
                        more » 
                        « less   
                    This content will become publicly available on February 4, 2026
                            
                            LLM+AL: Bridging Large Language Models and Action Languages for Complex Reasoning about Actions
                        
                    
    
            Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we introduce a method that bridges the natural language understanding capability of LLMs with the symbolic reasoning capability of action languages---formal languages for reasoning about actions. Our approach, termed {\sf LLM+AL}, leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's expertise in automated reasoning based on encoded knowledge. We compare {\sf LLM+AL} against state-of-the-art LLMs, including {\sc ChatGPT-4}, {\sc Claude 3 Opus}, {\sc Gemini Ultra 1.0}, and {\sc o1-preview}, using benchmarks for complex reasoning about actions. Our findings indicate that while all methods exhibit various errors, {\sf LLM+AL}, with relatively simple human corrections, consistently leads to correct answers, whereas using LLMs alone does not yield improvements even after human intervention. {\sf LLM+AL} also contributes to automated generation of action languages. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2006747
- PAR ID:
- 10564711
- Publisher / Repository:
- AAAI Press
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples.more » « less
- 
            While large language models (LLMs) equipped with techniques like chain-of-thought prompting have demonstrated impressive capabilities, they still fall short in their ability to reason robustly in complex settings. However, evaluating LLM reasoning is challenging because system capabilities continue to grow while benchmark datasets for tasks like logical deduction have remained static. We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm, enabling the construction of complex reasoning instances that challenge GPT-4 (e.g., murder mysteries roughly 1000 words in length) and which can be scaled further as more capable LLMs are released. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning; this makes it simultaneously much more challenging than other synthetically-crafted benchmarks while remaining realistic and tractable for human annotators to solve with high accuracy. We evaluate a range of LLMs and prompting techniques on this dataset and characterize the gaps that remain for techniques like chain-of-thought to perform robust reasoning.more » « less
- 
            Mills, Caitlin; Alexandron, Giora; Taibi, Davide; Lo_Bosco, Giosuè; Paquette, Luc (Ed.)Short answer assessment is a vital component of science education, allowing evaluation of students' complex three-dimensional understanding. Large language models (LLMs) that possess human-like ability in linguistic tasks are increasingly popular in assisting human graders to reduce their workload. However, LLMs' limitations in domain knowledge restrict their understanding in task-specific requirements and hinder their ability to achieve satisfactory performance. Retrieval-augmented generation (RAG) emerges as a promising solution by enabling LLMs to access relevant domain-specific knowledge during assessment. In this work, we propose an adaptive RAG framework for automated grading that dynamically retrieves and incorporates domain-specific knowledge based on the question and student answer context. Our approach combines semantic search and curated educational sources to retrieve valuable reference materials. Experimental results in a science education dataset demonstrate that our system achieves an improvement in grading accuracy compared to baseline LLM approaches. The findings suggest that RAG-enhanced grading systems can serve as reliable support with efficient performance gains.more » « less
- 
            Large Language Models (LLMs) have achieved remarkable success in natural language tasks, yet understanding their reasoning processes re- mains a significant challenge. We address this by introducing XplainLLM, a dataset accom- panying an explanation framework designed to enhance LLM transparency and reliability. Our dataset comprises 24,204 instances where each instance interprets the LLM’s reasoning behavior using knowledge graphs (KGs) and graph attention networks (GAT), and includes explanations of LLMs such as the decoder- only Llama-3 and the encoder-only RoBERTa. XplainLLM also features a framework for gener- ating grounded explanations and the debugger- scores for multidimensional quality analysis. Our explanations include why-choose and why- not-choose components, reason-elements, and debugger-scores that collectively illuminate the LLM’s reasoning behavior. Our evaluations demonstrate XplainLLM’s potential to reduce hallucinations and improve grounded explana- tion generation in LLMs. XplainLLM is a re- source for researchers and practitioners to build trust and verify the reliability of LLM outputs. Our code and dataset are publicly available.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
