skip to main content


This content will become publicly available on December 17, 2024

Title: StarCoder: May the Source be With You!
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python, can be prompted to achieve 40% pass@1 on HumanEval, and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.  more » « less
Award ID(s):
2102288
NSF-PAR ID:
10483982
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Editor(s):
Durrett, G
Publisher / Repository:
OpenReview
Date Published:
Journal Name:
Transactions on machine learning research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Automating hardware design could obviate a signif-icant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). We release our training/evaluation scripts and LLM checkpoints as open source contributions. 
    more » « less
  2. With recent advancements, large language models (LLMs) such as ChatGPT and Bard have shown the potential to disrupt many industries, from customer service to healthcare. Traditionally, humans interact with geospatial data through software (e.g., ArcGIS 10.3) and programming languages (e.g., Python). As a pioneer study, we explore the possibility of using an LLM as an interface to interact with geospatial datasets through natural language. To achieve this, we also propose a framework to (1) train an LLM to understand the datasets, (2) generate geospatial SQL queries based on a natural language question, (3) send the SQL query to the backend database, (4) parse the database response back to human language. As a proof of concept, a case study was conducted on real-world data to evaluate its performance on various queries. The results show that LLMs can be accurate in generating SQL code for most cases, including spatial joins, although there is still room for improvement. As all geospatial data can be stored in a spatial database, we hope that this framework can serve as a proxy to improve the efficiency of spatial data analyses and unlock the possibility of automated geospatial analytics.

     
    more » « less
  3. Michael Pradel (Ed.)
    Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages. 
    more » « less
  4. The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license. 
    more » « less
  5. Abstract. Hutton et al. (2016) argued that computational hydrology can only be a proper science if the hydrological community makes sure that hydrological model studies are executed and presented in a reproducible manner. Hut, Drost and van de Giesen replied that to achieve this hydrologists should not “re-invent the water wheel” but rather use existing technology from other fields (such as containers and ESMValTool) and open interfaces (such as the Basic Model Interface, BMI) to do their computational science (Hut et al., 2017). With this paper and the associated release of the eWaterCycle platform and software package (available on Zenodo: https://doi.org/10.5281/zenodo.5119389, Verhoeven et al., 2022), we are putting our money where our mouth is and providing the hydrological community with a “FAIR by design” (FAIR meaning findable, accessible, interoperable, and reproducible) platform to do science. The eWaterCycle platform separates the experiments done on the model from the model code. In eWaterCycle, hydrological models are accessed through a common interface (BMI) in Python and run inside of software containers. In this way all models are accessed in a similar manner facilitating easy switching of models, model comparison and model coupling. Currently the following models and model suites are available through eWaterCycle: PCR-GLOBWB 2.0, wflow, Hype, LISFLOOD, MARRMoT, and WALRUS While these models are written in different programming languages they can all be run and interacted with from the Jupyter notebook environment within eWaterCycle. Furthermore, the pre-processing of input data for these models has been streamlined by making use of ESMValTool. Forcing for the models available in eWaterCycle from well-known datasets such as ERA5 can be generated with a single line of code. To illustrate the type of research that eWaterCycle facilitates, this paper includes five case studies: from a simple “hello world” where only a hydrograph is generated to a complex coupling of models in different languages. In this paper we stipulate the design choices made in building eWaterCycle and provide all the technical details to understand and work with the platform. For system administrators who want to install eWaterCycle on their infrastructure we offer a separate installation guide. For computational hydrologists that want to work with eWaterCycle we also provide a video explaining the platform from a user point of view (https://youtu.be/eE75dtIJ1lk, last access: 28 June 2022)​​​​​​​. With the eWaterCycle platform we are providing the hydrological community with a platform to conduct their research that is fully compatible with the principles of both Open Science and FAIR science. 
    more » « less