skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Statically Contextualizing Large Language Models with Typed Holes
Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate code context, particularly when working with definitions that are neither in the training data nor near the cursor. This paper demonstrates that tighter integration with the type and binding structure of the programming language in use, as exposed by its language server, can help address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server is able to identify the type and typing context of the hole that the programmer is filling, with Hazel's total syntax and type error correction ensuring that a meaningful program sketch is available whenever the developer requests a completion. This allows the system to prompt the LLM with codebase-wide contextual information that is not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer's goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server, which provides error localization and error messages. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications with accompanying unit tests that have been written from scratch to avoid data contamination, and that can easily be ported to new languages because they do not have large external library dependencies. These applications serve as challenge problems due to their extensive reliance on application-specific data structures. Through an ablation study, we examine the impact of contextualization with type definitions, function headers, and errors messages, individually and in combination. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel, a low-resource language, we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource mainstream languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.  more » « less
Award ID(s):
2238744
PAR ID:
10593047
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Proceedings of the ACM on Programming Languages
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
8
Issue:
OOPSLA2
ISSN:
2475-1421
Page Range / eLocation ID:
468 to 498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called MultiPL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. MultiPL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with MultiPL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that MultiPL-T continues to outperform other fine-tuning approaches for low-resource languages. The MultiPL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer. 
    more » « less
  2. Code completions produced by today’s large language models (LLMs) offer no formal guarantees. We propose proof-carrying code completions (𝑃𝐢3). In this paradigm, a high-resourced entity (the LLM provided by the server) must provide a code completion to- gether with a proof of a chosen safety property which can be inde- pendently checked by a low-resourced entity (the user). In order to provide safety proofs without requiring the user to write specifica- tions in formal logic, we statically generate preconditions for all dangerous function calls (i.e., functions that may violate the safety property) which must be proved by the LLM. To demonstrate the main ideas, we provide a prototype imple- mentation in the program verification language Dafny, and a case study focusing on file system vulnerabilities. Unlike Python code generated by GPT-4, Dafny code generated by 𝑃𝐢3 provably avoids a common weakness related to path traversal (CWE-35), using a single generation attempt (π‘˜= 1)and a modest number of to- kens (3,350). Our tool is available as an open source repository at https://github.com/DavisPL/PCCC. 
    more » « less
  3. In a programmer's pursuit of using or creating new programming languages, finding errors in the syntax of code can present many issues. Languages with little to no documentation and incomprehensible exception handling and reports are frustrating to work with and can create confusion when trying to locate where in the code the program has faulted. In this paper we present {\em CodeBlock}, a parser generator and syntax checker for arbitrary programming languages. CodeBlock is a block based grammar builder for any programming language that constructs a parsing expression grammar for the language based on user built expressions. This grammar can then be used within the CodeBlock website or in the CodeBlock Node.JS application to test the syntax of either written code, or files containing code in the language, reporting comprehensible error messages if errors in syntax are found. Our eventual goal is to incorporate CodeBlock into a compiler design tutoring system, called {\em CompiTS}, in which it will play a central role in teaching students how to design new programming languages and test the effectiveness of the new language using rapid prototyping and a translational approach to implementation. This is an emerging research, and in this paper, we only focus on the syntax checking component of the CompiTS system. 
    more » « less
  4. null (Ed.)
    Source code (Context) and its parsed abstract syntax tree (AST; Structure) are two complementary representations of the same computer program. Traditionally, designers of machine learning models have relied predominantly either on Structure or Context. We propose a new model, which jointly learns on Context and Structure of source code. In contrast to previous approaches, our model uses only language-agnostic features, i.e., source code and features that can be computed directly from the AST. Besides obtaining state-of-the-art on monolingual code summarization on all five programming languages considered in this work, we propose the first multilingual code summarization model. We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages, where the strongest gains are on low-resource languages. Remarkably, multilingual training only from Context does not lead to the same improvements, highlighting the benefits of combining Structure and Context for representation learning on code. 
    more » « less
  5. This paper systematically explores how Large Language Models (LLMs) generate explanations of code examples of the type used in intro-to-programming courses. As we show, the nature of code explanations generated by LLMs varies considerably based on the wording of the prompt, the target code examples being explained, the programming language, the temperature parameter, and the version of the LLM. Nevertheless, they are consistent in two major respects for Java and Python: the readability level, which hovers around 7-8 grade, and lexical density, i.e., the relative size of the meaninful words with respect to the total explanation size. Furthermore, the explanations score very high in correctness but less on three other metrics: completeness, conciseness, and contextualization. 
    more » « less