Large language models like GPT4 exhibit emergent capabilities across generalpurpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, nexttoken prediction objective. This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the nexttoken prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to lowrank matrix completion. Building on prior work, we then train on chainofthought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of fewshot prompting, pretraining, and parameter scaling. Additionally, we discuss the challenges associated with length generalization. Our work highlights the importance of highquality, instructive data that considers the particular characteristics of the nextword prediction loss for rapidly eliciting arithmetic capabilities.
more »
« less
This content will become publicly available on May 1, 2025
Backtracking Mathematical Reasoning of Language Models to the Pretraining Data
Incontext learning and chainofthought prompting have demonstrated surprising performance improvements on mathematical reasoning benchmarks. Therefore, understanding the underlying factors enabling these capabilities is crucial. However, the specific aspects of pretraining data that equip models with mathematical reasoning capabilities remain largely unexplored and are less studied systematically. In this study, we identify subsets of model pretraining data that contribute to math reasoning ability of the model, and evaluate it on several mathematical operations (e.g. addition, multiplication) and tasks (e.g. the asdiv dataset). We measure the importance of such subsets by continual training of the model on pretraining data subsets, and then we quantify the change in performance on the mathematical benchmark to assess their importance. If a subset results in an improved performance, we conjecture that such subset contributes to a model's overall mathematical ability. Our results unveil that while training on mathonly data contributes to simple arithmetic abilities, it does not solely explain performance on more complex reasoning abilities like chainofthought reasoning. We also find that code data contributes to chainofthought reasoning while reducing the arithmetic performance.
more »
« less
 Award ID(s):
 2046873
 NSFPAR ID:
 10526345
 Publisher / Repository:
 Tiny Papers at the International Conference on Learning Representations (ICLR) and Neurips ATTRIB Workshop
 Date Published:
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in fewshot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPTbased language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above 70% (absolute) more accurate on the top 10% frequent terms in comparison to the bottom 10%. Overall, although LMs appear successful at fewshot numerical reasoning, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.more » « less

Prior work has combined chainofthought prompting in large language models (LLMs) with programmatic representations to perform effective and transparent reasoning. While such an approach works well for tasks that only require forward reasoning (e.g., straightforward arithmetic), it is less effective for constraint solving problems that require more sophisticated planning and search. In this paper, we propose a new satisfiabilityaided language modeling (SatLM) approach for improving the reasoning capabilities of LLMs. We use an LLM to generate a declarative task specification rather than an imperative program and leverage an offtheshelf automated theorem prover to derive the final answer. This approach has two key advantages. The declarative specification is closer to the problem description than the reasoning steps are, so the LLM can parse it out of the description more accurately. Furthermore, by offloading the actual reasoning task to an automated theorem prover, our approach can guarantee the correctness of the answer with respect to the parsed specification and avoid planning errors in the solving process. We evaluate SATLM on 8 different datasets and show that it consistently outperforms programaided LMs in the imperative paradigm. In particular, SATLM outperforms programaided LMs by 23% on a challenging subset of the GSM arithmetic reasoning dataset; SATLM also achieves a new SoTA on LSAT and BoardgameQA, surpassing previous models that are trained on the respective training sets.more » « less

This study characterises a previously unstudied facet of a major causal model of math anxiety. The model posits that impaired “basic number abilities” can lead to math anxiety, but what constitutes a basic number ability remains underdefined. Previous work has raised the idea that our perceptual ability to represent quantities approximately without using symbols constitutes one of the basic number abilities. Indeed, several recent studies tested how participants with math anxiety estimate and compare nonsymbolic quantities. However, little is known about how participants with math anxiety perform arithmetic operations (addition and subtraction) on nonsymbolic quantities. This is an important question because poor arithmetic performance on symbolic numbers is one of the primary signatures of high math anxiety. To test the question, we recruited 92 participants and asked them to complete a math anxiety survey, two measures of working memory, a timed symbolic arithmetic test, and a nonsymbolic “approximate arithmetic” task. We hypothesised that if impaired ability to perform operations was a potential causal factor to math anxiety, we should see relationships between math anxiety and both symbolic and approximate arithmetic. However, if math anxiety relates to precise or symbolic representation, only a relationship between math anxiety and symbolic arithmetic should appear. Our results show no relationship between math anxiety and the ability to perform operations with approximate quantities, suggesting that difficulties performing perceptually based arithmetic operations do not constitute a basic number ability linked to math anxiety.more » « less

Despite the significant advancements in the field of Natural Language Processing (NLP), Large Language Models (LLMs) have shown limitations in performing complex tasks that require arithmetic, commonsense, and symbolic reasoning. Reasoning frameworks like ReAct, Chainofthought (CoT), Treeofthoughts (ToT), etc. have shown success but with limitations in solving longform complex tasks. To address this, we propose a knowledgesharing and collaborative multiagent assisted framework on LLMs that leverages the capabilities of existing reasoning frameworks and the collaborative skills of multiagent systems (MASs). The objectives of the proposed framework are to overcome the limitations of LLMs, enhance their reasoning capabilities, and improve their performance in complex tasks. It involves generating natural language rationales and incontext fewshot learning via prompting, and integrates the reasoning techniques with efficient knowledgesharing and communication driven agent networks. The potential benefits of the proposed framework include saving time and money, improved efficiency for computationally intensive reasoning, and the ability to incorporate multiple collaboration strategies for dynamically changing environments.more » « less