skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: How Much Do Prompting Methods Help LLMs on Quantitative Reasoning with Irrelevant Information?
Real-world quantitative reasoning problems are complex, often including extra information irrelevant to the question (or “IR noise” for short). State-of-the-art (SOTA) prompting methods have increased the Large Language Model’s ability for quantitative rea-soning on grade-school Math Word Problems (MWPs). To assess how well these SOTA methods handle IR noise, we constructed four new datasets with IR noise, each consisting of 300 problems from each of the four public datasets: MAWPS, ASDiv, SVAMP, and GSM8K, with added IR noise. We called the collection of these new datasets “MPN”—Math Word Problems with IR Noise. We evaluated SOTA prompting methods using MPN. We propose Noise Reduction Prompting (NRP) and its variant (NRP+) to reduce the impact of IR noise. Findings: Our IR noise significantly degrades the performance of Chain-of-Thought (CoT) Prompting on three different backend models: ChatGPT (gpt-3.5-turbo-0613), PaLM2, and Llama3-8B-instruct. Among them, ChatGPT offers the best accuracy on MPN with and without IR noise. With IR noise, the performances of CoT, Least-To-Most Prompting, Progressive-Hint Prompting, and Program-aided Language Models with ChatGPT were significantly impacted, each with an average accuracy drop of above 12%. NRP is least impacted by the noise, with a drop in average accuracy to only around 1.9%. Our NRP+ and NRP perform comparably in the presence of IR noise.  more » « less
Award ID(s):
2152117
PAR ID:
10611681
Author(s) / Creator(s):
;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400704369
Page Range / eLocation ID:
2128 to 2137
Subject(s) / Keyword(s):
Prompt Engineering, Large Language Model, Trustworthy Information Extraction, Quantitative Reasoning, Math Word Problem
Format(s):
Medium: X
Location:
Boise ID USA
Sponsoring Org:
National Science Foundation
More Like this
  1. While Chain-of-Thought (CoT) prompting boosts Language Models’ (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy. 
    more » « less
  2. Language models are achieving impressive performance on various tasks by aggressively adopting inference-time prompting techniques,such as zero-shot and few-shot prompting. In this work, we introduce EchoPrompt, a simple yet effective approach that prompts the model to rephrase its queries before answering them. EchoPrompt is tailored for four scenarios, including standard and chain-of-thought prompting, in both zero-shot and few-shot settings. Experimental results show that EchoPrompt yields substantial improvementsacross all these settings for four families of causal language models. These improvements are observed across various numerical reasoning (e.g., GSM8K, SVAMP), reading comprehension (e.g., DROP), and logical reasoning (e.g., Coin flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. Our empirical results indicate that EchoPrompt is an effective technique that enhances in-context learning performance. 
    more » « less
  3. The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios. 
    more » « less
  4. We present a conversational AI tutor (CAIT) for the purpose of aiding students on middle school math problems. CAIT was created utilizing the CLASS framework, and it is an LLM fine-tuned on Vicuna using a conversational dataset created by prompting ChatGPT using problems and explanations in ASSISTments. CAIT is trained to generate scaffolding questions, provide hints, and correct mistakes on math problems. We find that CAIT identifies 60% of correct answers as correct, generates effective sub-problems 33% of the time, and has a positive sentiment 72% of the time, with the remaining 28% of interactions being neutral. This paper discusses the hurdles to further implementation of CAIT into ASSISTments, namely improved accuracy and efficacy of sub-problems, and establishes CAIT as a proof of concept that the CLASS framework can be applied to create an effective mathematics tutorbot. 
    more » « less
  5. Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output, often referred to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT explanations as the LLM’s process for solving a task. This level of transparency into LLMs’ predictions would yield significant safety benefits. However, we find that CoT explanations can systematically misrepresent the true reason for a model’s prediction. We demonstrate that CoT explanations can be heavily influenced by adding biasing features to model inputs—e.g., by reordering the multiple-choice options in a few-shot prompt to make the answer always “(A)”—which models systematically fail to mention in their explanations. When we bias models toward incorrect answers, they frequently generate CoT explanations rationalizing those answers. This causes accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task, model explanations justify giving answers in line with stereotypes without mentioning the influence of these social biases. Our findings indicate that CoT explanations can be plausible yet misleading, which risks increasing our trust in LLMs without guaranteeing their safety. Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods. 
    more » « less