Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effortintensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proofscript space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that CoqHammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of stateoftheart proofscript synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 opensource software projects.
more »
« less
This content will become publicly available on April 28, 2026
QEDCartographer: Automating Formal Verification Using RewardFree Reinforcement Learning
Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proofsynthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling rewardfree search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 opensource Coq projects. QEDCartographer fully automatically proves 21.4% of the testset theorems. Previous searchbased proofsynthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and nonlearningbased CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proofsynthesis tools' search mechanisms.
more »
« less
 Award ID(s):
 2210243
 NSFPAR ID:
 10542916
 Publisher / Repository:
 IEEE
 Date Published:
 Format(s):
 Medium: X
 Location:
 Ottawa, Ontario, Canada
 Sponsoring Org:
 National Science Foundation
More Like this


Formally verifying system properties is one of the most effective ways of improving system quality, but its high manual effort requirements often render it prohibitively expensive. Tools that automate formal verification by learning from proof corpora to synthesize proofs have just begun to show their promise. These tools are effective because of the richness of the data the proof corpora contain. This richness comes from the stylistic conventions followed by communities of proof developers, together with the powerful logical systems beneath proof assistants. However, this richness remains underexploited, with most work thus far focusing on architecture rather than on how to make the most of the proof data. This article systematically explores how to most effectively exploit one aspect of that proof data: identifiers. We develop the Passport approach, a method for enriching the predictive Coq model used by an existing proofsynthesis tool with three new encoding mechanisms for identifiers: category vocabulary indexing, subword sequence modeling, and path elaboration. We evaluate our approach’s enrichment effect on three existing base tools: ASTactic, Tac, and Tok. In headtohead comparisons, Passport automatically proves 29% more theorems than the bestperforming of these base tools. Combining the three tools enhanced by the Passport approach automatically proves 38% more theorems than combining the three base tools. Finally, together, these base tools and their enhanced versions prove 45% more theorems than the combined base tools. Overall, our findings suggest that modeling identifiers can play a significant role in improving proof synthesis, leading to higherquality software.more » « less

Interactive proofs of theorems often require auxiliary helper lemmas to prove the desired theorem. Existing approaches for automatically synthesizing helper lemmas fall into two broad categories. Some approaches are goaldirected, producing lemmas specifically to help a user make progress from a given proof state, but they have limited expressiveness in terms of the lemmas that can be produced. Other approaches are highly expressive, able to generate arbitrary lemmas from a given grammar, but they are completely undirected and hence not amenable to interactive usage. In this paper, we develop an approach to lemma synthesis that is both goaldirected and expressive. The key novelty is a technique for reducing lemma synthesis to a datadriven program synthesis problem, whereby examples for synthesis are generated from the current proof state. We also describe a technique to systematically introduce new variables for lemma synthesis, as well as techniques for filtering and ranking candidate lemmas for presentation to the user. We implement these ideas in a tool called lfind, which can be run as a Coq tactic. In an evaluation on four benchmark suites, lfind produces useful lemmas in 68% of the cases where a human prover used a lemma to make progress. In these cases lfind synthesizes a lemma that either enables a fully automated proof of the original goal or that matches the humanprovided lemma.more » « less

Cook, S. ; Katz, B. ; MooreRusso, D. (Ed.)This paper presents six categories of undergraduate student explanations and justifications regarding the question of whether a converse proof proves a conditional theorem. Two categories of explanation led students to judge that converse proofs cannot so prove, which is the normative interpretation. These judgments depended upon students spontaneously seeking uniform rules of proving across various theorems or assigning a direction to the theorems and proof. The other four categories of explanation led students to affirm that converse proofs prove. We emphasize the rationality of these nonnormative explanations to suggest the need for further work to understand how we can help students understand the normative rules of logic.more » « less

Tauman_Kalai, Yael (Ed.)Connections between proof complexity and circuit complexity have become major tools for obtaining lower bounds in both areas. These connections  which take the form of interpolation theorems and querytocommunication lifting theorems  translate efficient proofs into small circuits, and vice versa, allowing tools from one area to be applied to the other. Recently, the theory of TFNP has emerged as a unifying framework underlying these connections. For many of the proof systems which admit such a connection there is a TFNP problem which characterizes it: the class of problems which are reducible to this TFNP problem via queryefficient reductions is equivalent to the tautologies that can be efficiently proven in the system. Through this, proof complexity has become a major tool for proving separations in blackbox TFNP. Similarly, for certain monotone circuit models, the class of functions that it can compute efficiently is equivalent to what can be reduced to a certain TFNP problem in a communicationefficient manner. When a TFNP problem has both a proof and circuit characterization, one can prove an interpolation theorem. Conversely, many lifting theorems can be viewed as relating the communication and query reductions to TFNP problems. This is exciting, as it suggests that TFNP provides a roadmap for the development of further interpolation theorems and lifting theorems. In this paper we begin to develop a more systematic understanding of when these connections to TFNP occur. We give exact conditions under which a proof system or circuit model admits a characterization by a TFNP problem. We show:  Every wellbehaved proof system which can prove its own soundness (a reflection principle) is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a wellbehaved proof system which proves its own soundness.  Every wellbehaved monotone circuit model which admits a universal family of functions is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a wellbehaved monotone circuit model with a universal problem. As an example, we provide a TFNP characterization of the Polynomial Calculus, answering a question from [Mika Göös et al., 2022], and show that it can prove its own soundness.more » « less