- Award ID(s):
- 2210243
- PAR ID:
- 10461012
- Date Published:
- Journal Name:
- ACM Transactions on Programming Languages and Systems
- Volume:
- 45
- Issue:
- 2
- ISSN:
- 0164-0925
- Page Range / eLocation ID:
- 1 to 30
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools' search mechanisms.more » « less
-
Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mechanism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that Coq-Hammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40X. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.more » « less
-
Tauman_Kalai, Yael (Ed.)Connections between proof complexity and circuit complexity have become major tools for obtaining lower bounds in both areas. These connections - which take the form of interpolation theorems and query-to-communication lifting theorems - translate efficient proofs into small circuits, and vice versa, allowing tools from one area to be applied to the other. Recently, the theory of TFNP has emerged as a unifying framework underlying these connections. For many of the proof systems which admit such a connection there is a TFNP problem which characterizes it: the class of problems which are reducible to this TFNP problem via query-efficient reductions is equivalent to the tautologies that can be efficiently proven in the system. Through this, proof complexity has become a major tool for proving separations in black-box TFNP. Similarly, for certain monotone circuit models, the class of functions that it can compute efficiently is equivalent to what can be reduced to a certain TFNP problem in a communication-efficient manner. When a TFNP problem has both a proof and circuit characterization, one can prove an interpolation theorem. Conversely, many lifting theorems can be viewed as relating the communication and query reductions to TFNP problems. This is exciting, as it suggests that TFNP provides a roadmap for the development of further interpolation theorems and lifting theorems. In this paper we begin to develop a more systematic understanding of when these connections to TFNP occur. We give exact conditions under which a proof system or circuit model admits a characterization by a TFNP problem. We show: - Every well-behaved proof system which can prove its own soundness (a reflection principle) is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a well-behaved proof system which proves its own soundness. - Every well-behaved monotone circuit model which admits a universal family of functions is characterized by a TFNP problem. Conversely, every TFNP problem gives rise to a well-behaved monotone circuit model with a universal problem. As an example, we provide a TFNP characterization of the Polynomial Calculus, answering a question from [Mika Göös et al., 2022], and show that it can prove its own soundness.more » « less
-
In spite of decades of research in bug detection tools, there is a surprising dearth of ground-truth corpora that can be used to evaluate the efficacy of such tools. Recently, systems such as LAVA and EvilCoder have been proposed to automatically inject bugs into software to quickly generate large bug corpora, but the bugs created so far differ from naturally occurring bugs in a number of ways. In this work, we propose a new automated bug injection system, Apocalypse, that uses formal techniques—symbolic execution, constraint-based program synthesis and model counting—to automatically inject fair (can potentially be discovered by current bug-detection tools), deep (requiring a long sequence of dependencies to be satisfied to fire), uncorrelated (each bug behaving independent of others), reproducible (a trigger input being available) and rare (can be triggered by only a few program inputs) bugs in large software code bases. In our evaluation, we inject bugs into thirty Coreutils programs as well as the TCAS test suite. We find that bugs synthesized by Apocalypse are highly realistic under a variety of metrics, that they do not favor a particular bug-finding strategy (unlike bugs produced by LAVA), and that they are more difficult to find than manually injected bugs, requiring up around 240× more tests to discover with a state-of-the-art symbolic execution tool.more » « less
-
Generating a high-quality explainable summary of a multi-review corpus can help people save time in reading the reviews. With natural language processing and text clustering, people can generate both abstractive and extractive summaries on a corpus containing up to 967 product reviews (Moody et al. 2022). However, the overall quality of the summaries needs further improvement. Noticing that online reviews in the corpus come from a diverse population, we take an approach of removing irrelevant human factors through pre-processing. Apply available pre-trained models together with reference based and reference free metrics, we filter out noise in each review automatically prior to summary generation. Our computational experiments evident that one may significantly improve the overall quality of an explainable summary from such a pre-processed corpus than from the original one. It is suggested of applying available high-quality pre-trained tools to filter noises rather than start from scratch. Although this work is on the specific multi-review corpus, the methods and conclusions should be helpful for generating summaries for other multi-review corpora.