General mathematical reasoning is computationally undecidable, but humans
routinely solve new problems. Moreover, discoveries developed over centuries
are taught to subsequent generations quickly. What structure enables this, and
how might that inform automated mathematical reasoning? We posit that central
to both puzzles is the structure of procedural abstractions underlying
mathematics. We explore this idea in a case study on 5 sections of beginning
algebra on the Khan Academy platform. To define a computational foundation, we
introduce Peano, a theorem-proving environment where the set of valid actions
at any point is finite. We use Peano to formalize introductory algebra problems
and axioms, obtaining well-defined search problems. We observe existing
reinforcement learning methods for symbolic reasoning to be insufficient to
solve harder problems. Adding the ability to induce reusable abstractions
(""tactics"") from its own solutions allows an agent to make steady progress,
solving all problems. Furthermore, these abstractions induce an order to the
problems, seen at random during training. The recovered order has significant
agreement with the expert-designed Khan Academy curriculum, and
second-generation agents trained on the recovered curriculum learn
significantly faster. These results illustrate the synergistic role of
abstractions and curricula in the cultural transmission of mathematics.
more »
« less
This content will become publicly available on December 7, 2025
PUTNAMBENCH: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition
We present PUTNAMBENCH, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems. PUTNAMBENCH consists of 1697 hand-constructed formalizations of 640 theorems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. All the theorems have formalizations in Lean 4 and Isabelle; a substantial subset also has Coq formalizations. Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PUTNAMBENCH to evaluate several established neural and symbolic theorem-provers. These approaches can only solve a handful of the PUTNAMBENCH problems, establishing the benchmark as a difficult open challenge for research on neural theorem-proving. PUTNAMBENCH is available at https://github.com/trishullab/PutnamBench.
more »
« less
- Award ID(s):
- 2212559
- PAR ID:
- 10549551
- Publisher / Repository:
- Neural Information Processing Systems (NeurIPS), 2024
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper describes a large set of related theorem prov- ing problems obtained by translating theorems from the HOL4 standard library into multiple logical formalisms. The formalisms are in higher- order logic (with and without type variables) and first-order logic (possi- bly with types, and possibly with type variables). The resultant problem sets allow us to run automated theorem provers that support different logical formalisms on corresponding problems, and compare their perfor- mances. This also results in a new “grand unified” large theory bench- mark that emulates the ITP/ATP hammer setting, where systems and metasystems can use multiple formalisms in complementary ways, and jointly learn from the accumulated knowledge.more » « less
-
Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.more » « less
-
Cook, S. ; Katz, B. ; Moore-Russo, D. (Ed.)This paper presents six categories of undergraduate student explanations and justifications regarding the question of whether a converse proof proves a conditional theorem. Two categories of explanation led students to judge that converse proofs cannot so prove, which is the normative interpretation. These judgments depended upon students spontaneously seeking uniform rules of proving across various theorems or assigning a direction to the theorems and proof. The other four categories of explanation led students to affirm that converse proofs prove. We emphasize the rationality of these non-normative explanations to suggest the need for further work to understand how we can help students understand the normative rules of logic.more » « less
-
Interactive theorem provers are computer programs that check whether mathematical statements are correct. We show how the mathematics of theories in chemical physics can be written in the language of the Lean theorem prover, allowing chemical theory to be made even more rigorous and providing insight into the mathematics behind a theory. We use Lean to precisely define the assumptions and derivations of the Langmuir and BET theories of adsorption. We can also go further and create a network of definitions that build off of each other. This allows us to define a common basis for equations of motion or thermodynamics and derive many statements about them, like the kinematic equations of motion or gas laws such as Boyle's law. This approach could be extended beyond chemistry, and we propose the creation of a library of formally-proven theories in all fields of science. Furthermore, the rigorous logic of theorem provers complements the generative capabilities of AI models that generate code; we anticipate their integration to be valuable for automating the discovery of new scientific theories.more » « less