Model-finders such as SAT-solvers are attractive for produc- ing concrete models, either as sample instances or as counterexamples when properties fail. However, the generated model is arbitrary. To ad- dress this, several research efforts have proposed principled forms of output from model-finders. These include minimal and maximal models, unsat cores, and proof-based provenance of facts. While these methods enjoy elegant mathematical foundations, they have not been subjected to rigorous evaluation on users to assess their utility. This paper presents user studies of these three forms of output performed on advanced students. We find that most of the output forms fail to be effective, and in some cases even actively mislead users. To make such studies feasible to run frequently and at scale, we also show how we can pose such studies on the crowdsourcing site Mechanical Turk.
more »
« less
Applying cognitive principles to model-finding output: the positive value of negative information
Model-finders, such as SAT/SMT-solvers and Alloy, are used widely both directly and embedded in domain-specific tools. They support both conventional verification and, unlike other verification tools, property-free exploration. To do this effectively, they must produce output that helps users with these tasks. Unfortunately, the output of model-finders has seen relatively little rigorous human-factors study. Conventionally, these tools tend to show one satisfying instance at a time. Drawing inspiration from the cognitive science literature, we investigate two aspects of model-finder output: how many instances to show at once, and whether all instances must actually satisfy the input constraints. Using both controlled studies and open-ended talk-alouds, we show that there is benefit to showing negative instances in certain settings; the impact of multiple instances is less clear. Our work is a first step in a theoretically grounded approach to understanding how users engage cognitively with model-finder output, and how those tools might better support users in doing so.
more »
« less
- PAR ID:
- 10602559
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- Proceedings of the ACM on Programming Languages
- Volume:
- 6
- Issue:
- OOPSLA1
- ISSN:
- 2475-1421
- Format(s):
- Medium: X Size: p. 1-29
- Size(s):
- p. 1-29
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Use and reuse of an ontology requires prior ontology verification which encompasses, at least, proving that the ontology is internally consistent and consistent with representative datasets. First-order logic (FOL) model finders are among the only available tools to aid us in this undertaking, but proving consistency of FOL ontologies is theoretically intractable while also rarely succeeding in practice, with FOL model finders scaling even worse than FOL theorem provers. This issue is further exacerbated when verifying FOL ontologies against datasets, which requires constructing models with larger domain sizes. This paper presents a first systematic study of the general feasibility of SAT-based model finding with FOL ontologies. We use select spatial ontologies and carefully controlled synthetic datasets to identify key measures that determine the size and difficulty of the resulting SAT problems. We experimentally show that these measures are closely correlated with the runtimes of Vampire and Paradox, two state-of-the-art model finders. We propose a definition elimination technique and demonstrate that it can be a highly effective measure for reducing the problem size and improving the runtime and scalability of model finding.more » « less
-
Model-finding tools like the Alloy Analyzer produce concrete examples of how a declarative specification can be satisfied. These formal tools are useful in a wide range of domains: software design, security, networking, and more. By producing concrete examples, they assist in exploring system behavior and can help find surprising faults. Specifications usually have many potential candidate solutions, but model- finders tend to leave the choice of which examples to present entirely to the underlying solver. This paper closes that gap by exploring notions of coverage for the model-finding domain, yielding a novel, rigorous metric for output quality. These ideas are realized in the tool CompoSAT, which interposes itself between Alloy’s constraint-solving and presentation stages to produce ensembles of examples that maximize coverage. We show that high-coverage ensembles like CompoSAT produces are useful for, among other things, detecting overconstraint—a particularly insidious form of specification error. We detail the underlying theory and implementation of CompoSAT and evaluate it on numerous specifications.more » « less
-
Abstract As artificial intelligence (AI) methods are increasingly used to develop new guidance intended for operational use by forecasters, it is critical to evaluate whether forecasters deem the guidance trustworthy. Past trust-related AI research suggests that certain attributes (e.g., understanding how the AI was trained, interactivity, and performance) contribute to users perceiving the AI as trustworthy. However, little research has been done to examine the role of these and other attributes for weather forecasters. In this study, we conducted 16 online interviews with National Weather Service (NWS) forecasters to examine (i) how they make guidance use decisions and (ii) how the AI model technique used, training, input variables, performance, and developers as well as interacting with the model output influenced their assessments of trustworthiness of new guidance. The interviews pertained to either a random forest model predicting the probability of severe hail or a 2D convolutional neural network model predicting the probability of storm mode. When taken as a whole, our findings illustrate how forecasters’ assessment of AI guidance trustworthiness is a process that occurs over time rather than automatically or at first introduction. We recommend developers center end users when creating new AI guidance tools, making end users integral to their thinking and efforts. This approach is essential for the development of useful andusedtools. The details of these findings can help AI developers understand how forecasters perceive AI guidance and inform AI development and refinement efforts. Significance StatementWe used a mixed-methods quantitative and qualitative approach to understand how National Weather Service (NWS) forecasters 1) make guidance use decisions within their operational forecasting process and 2) assess the trustworthiness of prototype guidance developed using artificial intelligence (AI). When taken as a whole, our findings illustrate that forecasters’ assessment of AI guidance trustworthiness is a process that occurs over time rather than automatically and suggest that developers must center the end user when creating new AI guidance tools to ensure that the developed tools are useful andused.more » « less
-
Abstract Numerous artificial intelligence-based weather prediction (AIWP) models have emerged over the past 2 years, mostly in the private sector. There is an urgent need to evaluate these models from a meteorological perspective, but access to the output of these models is limited. We detail two new resources to facilitate access to AIWP model output data in the hope of accelerating the investigation of AIWP models by the meteorological community. First, a 3-yr (and growing) reforecast archive beginning in October 2020 containing twice daily 10-day forecasts forFourCastNet v2-small,Pangu-Weather, andGraphCast Operationalis now available via an Amazon Simple Storage Service (S3) bucket through NOAA’s Open Data Dissemination (NODD) program (https://noaa-oar-mlwp-data.s3.amazonaws.com/index.html). This reforecast archive was initialized with both the NOAA’s Global Forecast System (GFS) and ECMWF’s Integrated Forecasting System (IFS) initial conditions in the hope that users can begin to perform the feature-based verification of impactful meteorological phenomena. Second, real-time output for these three models is visualized on our web page (https://aiweather.cira.colostate.edu) along with output from the GFS and the IFS. This allows users to easily compare output between each AIWP model and traditional, physics-based models with the goal of familiarizing users with the characteristics of AIWP models and determine whether the output aligns with expectations, is physically consistent and reasonable, and/or is trustworthy. We view these two efforts as a first step toward evaluating whether these new AIWP tools have a place in forecast operations.more » « less
An official website of the United States government
