skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 19, 2026

Title: LlamaRestTest: Effective REST API Testing with Small Language Models
Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on OpenAPI specifications. Although Large Language Models (LLMs) have shown promising test-generation abilities, their application to REST API testing remains mostly unexplored. We present LlamaRestTest, a novel approach that employs two custom LLMs-created by fine-tuning and quantizing the Llama3-8B model using mined datasets of REST API example values and inter-parameter dependencies-to generate realistic test inputs and uncover inter-parameter dependencies during the testing process by analyzing server responses. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results demonstrate that fine-tuning enables smaller models to outperform much larger models in detecting actionable parameter-dependency rules and generating valid inputs for REST API testing. We also evaluated different tool configurations, ranging from the base Llama3-8B model to fine-tuned versions, and explored multiple quantization techniques, including 2-bit, 4-bit, and 8-bit integer formats. Our study shows that small language models can perform as well as, or better than, large language models in REST API testing, balancing effectiveness and efficiency. Furthermore, LlamaRestTest outperforms state-of-the-art REST API testing tools in code coverage achieved and internal server errors identified, even when those tools use RESTGPT-enhanced specifications. Finally, through an ablation study, we show that each component of LlamaRestTest contributes to its overall performance.  more » « less
Award ID(s):
2107125
PAR ID:
10639036
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Software Engineering
Volume:
2
Issue:
FSE
ISSN:
2994-970X
Page Range / eLocation ID:
465 to 488
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In the rapidly evolving domain of software engineering (SE), Large Language Models (LLMs) are increasingly leveraged to automate developer support. Open source LLMs have grown competitive with pro- prietary models such as GPT-4 and Claude-3, without the associated financial and accessibility constraints. This study investigates whether state of the art open source LLMs including Solar-10.7B, CodeLlama-7B, Mistral-7B, Qwen2-7B, StarCoder2-7B, and LLaMA3-8B can generate responses to technical queries that align with those crafted by human experts. Leveraging retrieval augmented generation (RAG) and targeted fine tuning, we evaluate these models across critical performance dimen- sions, such as semantic alignment and contextual fluency. Our results show that Solar-10.7B, particularly when paired with RAG and fine tun- ing, most closely replicates expert level responses, o!ering a scalable and cost e!ective alternative to commercial models. This vision paper high- lights the potential of open-source LLMs to enable robust and accessible AI-powered developer assistance in software engineering. 
    more » « less
  2. Abstract Automatic Program Repair (APR) has garnered significant attention as a practical research domain focused on automatically fixing bugs in programs. While existing APR techniques primarily target imperative programming languages like C and Java, there is a growing need for effective solutions applicable to declarative software specification languages. This paper systematically investigates the capacity of Large Language Models (LLMs) to repair declarative specifications in Alloy, a declarative formal language used for software specification. We designed six different repair settings, encompassing single-agent and dual-agent paradigms, utilizing various LLMs. These configurations also incorporate different levels of feedback, including an auto-prompting mechanism for generating prompts autonomously using LLMs. Our study reveals that dual-agent with auto-prompting setup outperforms the other settings, albeit with a marginal increase in the number of iterations and token usage. This dual-agent setup demonstrated superior effectiveness compared to state-of-the-art Alloy APR techniques when evaluated on a comprehensive set of benchmarks. This work is the first to empirically evaluate LLM capabilities to repair declarative specifications, while taking into account recent trending LLM concepts such as LLM-based agents, feedback, auto-prompting, and tools, thus paving the way for future agent-based techniques in software engineering. 
    more » « less
  3. Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. 
    more » « less
  4. Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5x faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1x. Crucially, HALoS preserves the model quality of fully synchronous SGD-matching or exceeding accuracy on standard language modeling and downstream benchmarks-while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments. 
    more » « less
  5. Chemical reaction data has existed and still largely exists in unstructured forms. But curating such information into datasets suitable for tasks such as yield and reaction outcome prediction is impractical via manual curation and not possible to automate through programmatic means alone. Large language models (LLMs) have emerged as potent tools, showcasing remarkable capabilities in processing textual information and therefore could be extremely useful in automating this process. To address the challenge of unstructured data, we manually curated a dataset of structured chemical reaction data to fine-tune and evaluate LLMs. We propose a paradigm that leverages prompt-tuning, fine-tuning techniques, and a verifier to check the extracted information. We evaluate the capabilities of various LLMs, including LLAMA-2 and GPT models with different parameter counts, on the data extraction task. Our results show that prompt tuning of GPT-4 yields the best accuracy and evaluation results. Fine-tuning LLAMA-2 models with hundreds of samples does enable them and organize scientific material according to user-defined schemas better though. This workflow shows an adaptable approach for chemical reaction data extraction but also highlights the challenges associated with nuance in chemical information. We open-sourced our code at GitHub. 
    more » « less