Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effortintensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proofscript space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that CoqHammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of stateoftheart proofscript synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 opensource software projects.
more »
« less
This content will become publicly available on August 11, 2025
Learn from Failure: FineTuning LLMs with TrialandError Data for Intuitionistic Propositional Logic Proving
Recent advances in Automated Theorem Proving
have shown the effectiveness of leveraging
a (large) language model that generates
tactics (i.e. proof steps) to search through
proof states. The current model, while trained
solely on successful proof paths, faces a discrepancy
at the inference stage, as it must sample
and try various tactics at each proof state
until finding success, unlike its training which
does not incorporate learning from failed attempts.
Intuitively, a tactic that leads to a
failed search path would indicate that similar
tactics should receive less attention during
the following trials. In this paper, we demonstrate
the benefit of training models that additionally
learn from failed search paths. Facing
the lack of such trialanderror data in existing
opensource theoremproving datasets,
we curate a dataset on intuitionistic propositional
logic theorems and formalize it in Lean,
such that we can reliably check the correctness
of proofs. We compare our model trained
on relatively short trialanderror information
(TRIALMASTER) with models trained only on
the correct paths and discover that the former
solves more unseen theorems with lower trial
searches.
more »
« less
 Award ID(s):
 1955457
 NSFPAR ID:
 10519923
 Publisher / Repository:
 The 62nd Annual Meeting of the Association for Computational Linguistics
 Date Published:
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Recently there has been significant interest in using machine learning to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, FlowLoss, for learning cardinality estimation models. FlowLoss approximates the optimizer's cost model and search algorithm with analytical functions, which it uses to optimize explicitly for better query plans. At the heart of FlowLoss is a reduction of query optimization to a flow routing problem on a certain "plan graph", in which different paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB) which contains the ground truth cardinalities for subplans of over 16 K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with FlowLoss improves the plan costs and query runtimes despite having worse estimation accuracy than a model trained with QError. When the test set queries closely match the training queries, models trained with both loss functions perform well. However, the QErrortrained model degrades significantly when evaluated on slightly different queries (e.g., similar but unseen query templates), while the FlowLosstrained model generalizes better to such situations, achieving 4  8× better 99th percentile runtimes on unseen templates with the same model architecture and training data.more » « less

We present MLCERT, a novel system for doing practical mechanized proof of the generalization of learning procedures, bounding expected error in terms of training or test error. MLCERT is mechanized in that we prove generalization bounds inside the theorem prover Coq; thus the bounds are machine checked by Coq’s proof checker. MLCERT is practical in that we extract learning procedures defined in Coq to executable code; thus procedures with proved generalization bounds can be trained and deployed in real systems. MLCERT is well documented and open source; thus we expect it to be usable even by those without Coq expertise. To validate MLCERT, which is compatible with external tools such as TensorFlow, we use it to prove generalization bounds on neural networks trained using TensorFlow on the extended MNIST data set.more » « less

Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proofsynthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling rewardfree search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 opensource Coq projects. QEDCartographer fully automatically proves 21.4% of the testset theorems. Previous searchbased proofsynthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and nonlearningbased CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proofsynthesis tools' search mechanisms.more » « less

Tauman_Kalai, Yael (Ed.)Connections between proof complexity and circuit complexity have become major tools for obtaining lower bounds in both areas. These connections  which take the form of interpolation theorems and querytocommunication lifting theorems  translate efficient proofs into small circuits, and vice versa, allowing tools from one area to be applied to the other. Recently, the theory of TFNP has emerged as a unifying framework underlying these connections. For many of the proof systems which admit such a connection there is a TFNP problem which characterizes it: the class of problems which are reducible to this TFNP problem via queryefficient reductions is equivalent to the tautologies that can be efficiently proven in the system. Through this, proof complexity has become a major tool for proving separations in blackbox TFNP. Similarly, for certain monotone circuit models, the class of functions that it can compute efficiently is equivalent to what can be reduced to a certain TFNP problem in a communicationefficient manner. When a TFNP problem has both a proof and circuit characterization, one can prove an interpolation theorem. Conversely, many lifting theorems can be viewed as relating the communication and query reductions to TFNP problems. This is exciting, as it suggests that TFNP provides a roadmap for the development of further interpolation theorems and lifting theorems. In this paper we begin to develop a more systematic understanding of when these connections to TFNP occur. We give exact conditions under which a proof system or circuit model admits a characterization by a TFNP problem. We show:  Every wellbehaved proof system which can prove its own soundness (a reflection principle) is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a wellbehaved proof system which proves its own soundness.  Every wellbehaved monotone circuit model which admits a universal family of functions is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a wellbehaved monotone circuit model with a universal problem. As an example, we provide a TFNP characterization of the Polynomial Calculus, answering a question from [Mika Göös et al., 2022], and show that it can prove its own soundness.more » « less