skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: CAT-LM: Training Language Models on Aligned Code and Tests
Testing is an integral but often neglected part of the software development process. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code- under-test context when producing a test file. In this work, we propose the Aligned Code And Tests Language Model (CAT- LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger language models trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation.  more » « less
Award ID(s):
1910067
PAR ID:
10487734
Author(s) / Creator(s):
Publisher / Repository:
IEEE
Date Published:
Journal Name:
IEEE/ACM International Conference on Automated Software Engineering (ASE)
Format(s):
Medium: X
Location:
Luxembourg City, Luxembourg
Sponsoring Org:
National Science Foundation
More Like this
  1. Recently, there have been significant advances and wide-scale use of generative AI in natural language generation. Models such as OpenAI’s GPT3 and Meta’s LLaMA are widely used in chatbots, to summarize documents, and to generate creative content. These advances raise concerns about abuses of these models, especially in social media settings, such as large-scale generation of disinformation, manipulation campaigns that use AI-generated content, and personalized scams. We used stylometry (the analysis of style in natural language text) to analyze the style of AI-generated text. Specifically, we applied an existing authorship verification (AV) model that can predict if two documents are written by the same author on texts generated by GPT2, GPT3, ChatGPT and LLaMA. Our AV model was trained only on human-written text and was effectively used in social media settings to analyze cases of abuse. We generated texts by providing the language models with fanfiction snippets and prompting them to complete the rest of it in the same writing style as the original snippet. We then applied the AV model across the texts generated by the language models and the human written texts to analyze the similarity of the writing styles between these texts. We found that texts generated with GPT2 had the highest similarity to the human texts. Texts generated by GPT3 and ChatGPT were very different from the human snippet, and were similar to each other. LLaMA-generated texts had some similarity to the original snippet but also has similarities with other LLaMA-generated texts and texts from other models. We then conducted a feature analysis to identify the features that drive these similarity scores. This analysis helped us answer questions like which features distinguish the language style of language models and humans, which features are different across different models, and how these linguistic features change over different language model versions. The dataset and the source code used in this analysis have been made public to allow for further analysis of new language models. 
    more » « less
  2. Natural language generators for task-oriented dialog should be able to vary the style of the output utterance while still effectively realizing the system dialog actions and their associated semantics. While the use of neural generation for training the response generation component of conversational agents promises to simplify the process of producing high quality responses in new domains, to our knowledge, there has been very little investigation of neural generators for task-oriented dialog that can vary their response style and we know of no experiments on models that can generate responses that are different in style from those seen during training, while still maintaining semantic fidelity to the input meaning representation. Here, we show that a model that is trained to achieve a single stylistic personality target can produce outputs that combine stylistic targets. We carefully evaluate the multivoice outputs for both semantic fidelity and for similarities to and differences from the linguistic features that characterize the original training style. We show that contrary to our predictions, the learned models do not always simply interpolate model parameters, but rather produce styles that are distinct and novel from the personalities they were trained on. 
    more » « less
  3. Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer. 
    more » « less
  4. We recently proposed inline tests for validating individual program statements; they allow developers to provide test inputs, expected outputs, and test oracles immediately after a target statement. But, existing code can have many target statements. So, automatic generation of inline tests is an important next step towards increasing their adoption. We propose ExLi, the first technique for automatically generating inline tests. ExLi extracts inline tests from unit tests; it first records all variable values at a target statement while executing unit tests. Then, ExLi uses those values as test inputs and test oracles in an initial set of generated inline tests. Target statements that are executed many times could have redundant initial inline tests. So, ExLi uses a novel coverage-then-mutants based reduction process to remove redundant inline tests. We implement ExLi for Java and use it to generate inline tests for 718 target statements in 31 open-source programs. ExLi reduces 17,273 initially generated inline tests to 905 inline tests. The final set of generated inline tests kills up to 25.1% more mutants on target statements than developer written and automatically generated unit tests. That is, ExLi generates inline tests that can improve the fault-detection capability of the test suites from which they are extracted. 
    more » « less
  5. We recently proposed inline tests for validating individual program statements; they allow developers to provide test inputs, expected outputs, and test oracles immediately after a target statement. But, existing code can have many target statements. So, automatic generation of inline tests is an important next step towards increasing their adoption. We propose ExLi, the first technique for automatically generating inline tests. ExLi extracts inline tests from unit tests; it first records all variable values at a target statement while executing unit tests. Then, ExLi uses those values as test inputs and test oracles in an initial set of generated inline tests. Target statements that are executed many times could have redundant initial inline tests. So, ExLi uses a novel coverage-and-mutation based reduction process to remove redundant inline tests. We implement ExLi for Java and use it to generate inline tests for 718 target statements in 31 open-source programs. ExLi reduces 17,273 initially generated inline tests to 905 inline tests. The final set of generated inline tests kills up to 25.1% more mutants than developer written and automatically generated unit tests. That is, ExLi generates inline tests that can improve the fault-detection capability of the test suites from which they are extracted. 
    more » « less