NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning

Sanchez-Stern, Alex; Varghese, Abhishek; Kaufman, Zhanna; Zhang, Dylan; Ringer, Talia; Brun, Yuriy (April 2025, IEEE)

Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 26% shorter proofs 27% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 31.8% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools' search mechanisms.
more » « less
Free, publicly-accessible full text available April 28, 2026
Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Panchal, Kunjal; Parikh, Nisarg; Choudhary, Sunav; Zhang, Lijun; Brun, Yuriy; Guan, Hui (December 2024, NeurIPS)

Full Text Available
Attack-Resilient Image Watermarking Using Stable Diffusion

Zhang, Lijun; Liu, Xiao; Viros_i_Martin, Antoni; Xiong_Bearfield, Cindy; Brun, Yuriy; Hui, Guan (December 2024, NeurIPS)

Full Text Available
Automated Program Repair, What Is It Good For? Not Absolutely Nothing!

https://doi.org/10.1145/3597503.3639095

Eladawy, Hadeel; Le_Goues, Claire; Brun, Yuriy (April 2024, ACM)

Industrial deployments of automated program repair (APR), e.g., at Facebook and Bloomberg, signal a new milestone for this exciting and potentially impactful technology. In these deployments, developers use APR-generated patch suggestions as part of a human-driven debugging process. Unfortunately, little is known about how using patch suggestions affects developers during debugging. This paper conducts a controlled user study with 40 developers with a median of 6 years of experience. The developers engage in debugging tasks on nine naturally-occurring defects in real-world, open-source, Java projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools. For each debugging task, the developers either have access to the project's tests, or, also, to code suggestions that make all the tests pass. These suggestions are either developer-written or APR-generated, which can be correct or deceptive. Deceptive suggestions, which are a common APR occurrence, make all the available tests pass but fail to generalize to the intended specification. Through a total of 160 debugging sessions, we find that access to a code suggestion significantly increases the odds of submitting a patch. Correct APR suggestions increase the odds of debugging success by 14,000%, but deceptive suggestions decrease the odds of success by 65%. Correct suggestions also speed up debugging. Surprisingly, we observe no significant difference in how novice and experienced developers are affected by APR, suggesting that APR may find uses across the experience spectrum. Overall, developers come away with a strong positive impression of APR, suggesting promise for APR-mediated, human-driven debugging, despite existing challenges in APR-generated repair quality.
more » « less
Full Text Available
Baldur: Whole-Proof Generation and Repair with Large Language Models

https://doi.org/10.1145/3611643.3616243

First, Emily; Rabe, Markus; Ringer, Talia; Brun, Yuriy (November 2023, ACM)
My Model is Unfair, Do People Even Care? Visual Design Affects Trust and Perceived Bias in Machine Learning

https://doi.org/10.1109/TVCG.2023.3327192

Gaba, Aimen; Kaufman, Zhanna; Cheung, Jason; Shvakel, Marie; Hall, Kyle Wm; Brun, Yuriy; Bearfield, Cindy Xiong (January 2024, IEEE transactions on visualization and computer graphics)

Machine learning technology has become ubiquitous, but, unfortunately, often exhibits bias. As a consequence, disparate stakeholders need to interact with and make informed decisions about using machine learning models in everyday systems. Visualization technology can support stakeholders in understanding and evaluating trade-offs between, for example, accuracy and fairness of models. This paper aims to empirically answer “Can visualization design choices affect a stakeholder's perception of model bias, trust in a model, and willingness to adopt a model?” Through a series of controlled, crowd-sourced experiments with more than 1,500 participants, we identify a set of strategies people follow in deciding which models to trust. Our results show that men and women prioritize fairness and performance differently and that visual design choices significantly affect that prioritization. For example, women trust fairer models more often than men do, participants value fairness more when it is explained using text than as a bar chart, and being explicitly told a model is biased has a bigger impact than showing past biased performance. We test the generalizability of our results by comparing the effect of multiple textual and visual design choices and offer potential explanations of the cognitive mechanisms behind the difference in fairness perception and trust. Our research guides design considerations to support future work developing visualization systems for machine learning.
more » « less
Full Text Available
Rango: Adaptive Retrieval-Augmented Proving for Automated Software Verification

https://doi.org/10.1109/ICSE55347.2025.00161

Thompson, Kyle; Saavedra, Nuno; Carrott, Pedro; Fisher, Kevin; Sanchez-Stern, Alex; Brun, Yuriy; Ferreira, João F; Lerner, Sorin; First, Emily (April 2025, IEEE)

Free, publicly-accessible full text available April 26, 2026
Better Automatic Program Repair by Using Bug Reports and Tests Together

https://doi.org/10.1109/ICSE48619.2023.00109

Motwani, Manish; Brun, Yuriy (May 2023, Proceedings of the 45th International Conference on Software Engineering (ICSE))

Full Text Available
Understanding Why and Predicting When Developers Adhere to Code-Quality Standards

https://doi.org/10.1109/ICSE-SEIP58684.2023.00045

Motwani, Manish; Brun, Yuriy (May 2023, Proceedings of the Software Engineering in Practice Track at the 45th International Conference on Software Engineering (ICSE SEIP))

Full Text Available
Passport: Improving Automated Formal Verification Using Identifiers

https://doi.org/10.1145/3593374

Sanchez-Stern, Alex; First, Emily; Zhou, Timothy; Kaufman, Zhanna; Brun, Yuriy; Ringer, Talia (June 2023, ACM Transactions on Programming Languages and Systems)

Formally verifying system properties is one of the most effective ways of improving system quality, but its high manual effort requirements often render it prohibitively expensive. Tools that automate formal verification by learning from proof corpora to synthesize proofs have just begun to show their promise. These tools are effective because of the richness of the data the proof corpora contain. This richness comes from the stylistic conventions followed by communities of proof developers, together with the powerful logical systems beneath proof assistants. However, this richness remains underexploited, with most work thus far focusing on architecture rather than on how to make the most of the proof data. This article systematically explores how to most effectively exploit one aspect of that proof data: identifiers. We develop the Passport approach, a method for enriching the predictive Coq model used by an existing proof-synthesis tool with three new encoding mechanisms for identifiers: category vocabulary indexing, subword sequence modeling, and path elaboration. We evaluate our approach’s enrichment effect on three existing base tools: ASTactic, Tac, and Tok. In head-to-head comparisons, Passport automatically proves 29% more theorems than the best-performing of these base tools. Combining the three tools enhanced by the Passport approach automatically proves 38% more theorems than combining the three base tools. Finally, together, these base tools and their enhanced versions prove 45% more theorems than the combined base tools. Overall, our findings suggest that modeling identifiers can play a significant role in improving proof synthesis, leading to higher-quality software.
more » « less
Full Text Available

« Prev Next »

Search for: All records