Every program should be accompanied by a specification that describes important aspects of the code's behavior, but writing good specifications is often harder than writing the code itself. This paper addresses the problem of synthesizing specifications automatically, guided by user-supplied inputs of two kinds: i) a query posed about a set of function definitions, and ii) a domain-specific language L in which the extracted property is to be expressed (we call properties in the language L-properties). Each of the property is a best L-property for the query: there is no other L-property that is strictly more precise. Furthermore, the set of synthesized L-properties is exhaustive: no more L-properties can be added to it to make the conjunction more precise. We implemented our method in a tool, Spyro. The ability to modify both the query and L provides a Spyro user with ways to customize the kind of specification to be synthesized. We use this ability to show that Spyro can be used in a variety of applications, such as mining program specifications, performing abstract-domain operations, and synthesizing algebraic properties of program modules.
more »
« less
Automatically Generating Precise Oracles from Structured Natural Language Specifications
Software specifications often use natural language to describe the desired behavior, but such specifications are difficult to verify automatically. We present Swami, an automated technique that extracts test oracles and generates executable tests from structured natural language specifications. Swami focuses on exceptional behavior and boundary conditions that often cause field failures but that developers often fail to manually write tests for. Evaluated on the official JavaScript specification (ECMA-262), 98.4% of the tests Swami generated were precise to the specification. Using Swami to augment developer-written test suites improved coverage and identified 1 previously unknown defect and 15 missing JavaScript features in Rhino, 1 previously unknown defect in Node.js, and 18 semantic ambiguities in the ECMA-262 specification.
more »
« less
- PAR ID:
- 10108231
- Date Published:
- Journal Name:
- 41st International Conference on Software Engineering (ICSE)
- Page Range / eLocation ID:
- 188-199
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Conformance tests are critical for finding security weaknesses in carrier network systems. However, building a conformance test procedure from specifications is challenging, as indicated by the slow progress made by the 3GPP, particularly in developing security-related tests, even with a large amount of resources already committed. A unique challenge in building the procedure is that a testing system often cannot directly invoke the condition event in a security requirement or directly observe the occurrence of the operation expected to be triggered by the event. Addressing this issue requires an event chain to be found, which once initiated leads to a chain reaction so the testing system can either indirectly triggers the target event or indirectly observe the occurrence of the expected event. To find a solution to this problem and make progress towards a fully automated conformance test generation, we developed a new approach called Contester , which utilizes natural language processing and machine learning to build an event dependency graph from a 3GPP specification, and further perform automated reasoning on the graph to discover the event chains for a given security requirement. Such event chains are further converted by Contester into a conformance testing procedure, which is then executed by a testing system to evaluate the compliance of user equipment (UE) with the security requirement. Our evaluation shows that given 22 security requirements from the LTE NAS specifications, Contester successfully generated over a hundred test procedures in just 25 minutes. After running these procedures on 22 popular UEs including iPhone 13, Pixel 5a and IoT devices, our approach uncovered 197 security requirement violations, with 190 never reported before, rendering these devices to serious security risks such as MITM, fake base station and reply attacks.more » « less
-
Programmers often leverage data structure libraries that provide useful and reusable abstractions. Modular verification of programs that make use of these libraries naturally rely on specifications that capture important properties about how the library expects these data structures to be accessed and manipulated. However, these specifications are often missing or incomplete, making it hard for clients to be confident they are using the library safely. When library source code is also unavailable, as is often the case, the challenge to infer meaningful specifications is further exacerbated. In this paper, we present a novel data-driven abductive inference mechanism that infers specifications for library methods sufficient to enable verification of the library's clients. Our technique combines a data-driven learning-based framework to postulate candidate specifications, along with SMT-provided counterexamples to refine these candidates, taking special care to prevent generating specifications that overfit to sampled tests. The resulting specifications form a minimal set of requirements on the behavior of library implementations that ensures safety of a particular client program. Our solution thus provides a new multi-abduction procedure for precise specification inference of data structure libraries guided by client-side verification tasks. Experimental results on a wide range of realistic OCaml data structure programs demonstrate the effectiveness of the approach.more » « less
-
Highly-configurable software underpins much of our computing infrastructure. It enables extensive reuse, but opens the door to broken configuration specifications. The configuration specification language, Kconfig, is designed to prevent invalid configurations of the Linux kernel from being built. However, the astronomical size of the configuration space for Linux makes finding specification bugs difficult by hand or with random testing. In this paper, we introduce a software model checking framework for building Kconfig static analysis tools. We develop a formal semantics of the Kconfig language and implement the semantics in a symbolic evaluator called kclause that models Kconfig behavior as logical formulas. We then design and implement a bug finder, called kismet, that takes kclause models and leverages automated theorem proving to find unmet dependency bugs. kismet is evaluated for its precision, performance, and impact on kernel development for a recent version of Linux, which has over 140,000 lines of Kconfig across 28 architecture-specific specifications. Our evaluation finds 781 bugs (151 when considering sharing among Kconfig specifications) with 100% precision, spending between 37 and 90 minutes for each Kconfig specification, although it misses some bugs due to underapproximation. Compared to random testing, kismet finds substantially more true positive bugs in a fraction of the time.more » « less
-
Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance.more » « less
An official website of the United States government

