NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Vector embeddings by sequence similarity and context for improved compression, similarity search, clustering, organization, and manipulation of cDNA libraries

https://doi.org/10.1016/j.compbiolchem.2024.108251

Um, Daniel H; Knowles, David A; Kaiser, Gail E (February 2025, Computational Biology and Chemistry)

This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). By assigning a unique vector embedding to each short sequence, it is possible to more efficiently cluster and improve upon compression performance for the string representations of cDNA libraries. Furthermore, by studying alternative coordinate vector embeddings trained on the context of codon triplets, we can demonstrate clustering based on amino acid properties. Employing this sequence embedding method to encode barcodes and cDNA sequences, we can improve the time complexity of similarity searches. By pairing vector embeddings with an algorithm that determines the vector proximity in Euclidean space, this approach enables quicker and more flexible sequence searches.
more » « less
Free, publicly-accessible full text available February 1, 2026
Reinforest: Reinforcing Semantic Code Similarity for Cross-Lingual Code Search Models

https://doi.org/10.1109/SCAM63643.2024.00026

Saieva, Anthony; Chakraborty, Saikat; Kaiser, Gail (October 2024, IEEE)

This paper introduces a novel code-to-code search technique that enhances the performance of Large Language Models (LLMs) by including both static and dynamic features as well as utilizing both similar and dissimilar examples during training. We present the first-ever code search method that encodes dynamic runtime information during training without the need to execute either the corpus under search or the search query at inference time and the first code search technique that trains on both positive and negative reference samples. To validate the efficacy of our approach, we perform a set of studies demonstrating the capability of enhanced LLMs to perform cross-language code-to-code search. Our evaluation demonstrates that the effectiveness of our approach is consistent across various model architectures and programming languages. We outperform the state-of-the-art crosslanguage search tool by up to 44.7%. Moreover, our ablation studies reveal that even a single positive and negative reference sample in the training process results in substantial performance improvements demonstrating both similar and dissimilar references are important parts of code search. Importantly, we show that enhanced well-crafted, fine-tuned models consistently outperform enhanced larger modern LLMs without fine tuning, even when enhancing the largest available LLMs highlighting the importance for open-sourced models. To ensure the reproducibility and extensibility of our research, we present an open-sourced implementation of our tool and training procedures called REINFOREST.
more » « less
Full Text Available
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Ding, Yangruibo; Peng, Jinjun; Min, Marcus; Kaiser, Gail; Yang, Junfeng; Ray, Baishakhi (December 2024, Advances in Neural Information Processing Systems, NeurIPS 2024)

Full Text Available
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Ding, Yangruibo; Peng, Jinjun; Min, Marcus J; Kaiser, Gail; Yang, Junfeng; Ray, Baishakhi (September 2024, OpenReview.net)

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.
more » « less
Full Text Available
CYCLE: Learning to Self-Refine the Code Generation

https://doi.org/10.1145/3649825

Ding, Yangruibo; Min, Marcus J; Kaiser, Gail; Ray, Baishakhi (April 2024, Proceedings of the ACM on Programming Languages)

Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5
more » « less
Full Text Available
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Min, Marcus J; Ding, Yangruibo; Buratti, Luca; Pujar, Saurabh; Kaiser, Gail; Jana, Suman; Ray, Baishakhi (April 2024, OpenReview)

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.
more » « less
Full Text Available
TRACED: Execution-aware Pre-training for Source Code

https://doi.org/10.1145/3597503.3608140

Ding, Yangruibo; Steenhoek, Benjamin; Pei, Kexin; Kaiser, Gail; Le, Wei; Ray, Baishakhi (February 2024, ACM)

Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.
more » « less
Full Text Available
Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles

https://doi.org/10.1109/TSE.2022.3195640

Zhong, Ziyuan; Kaiser, Gail; Ray, Baishakhi (April 2023, IEEE Transactions on Software Engineering)

Full Text Available
CONCORD: Clone-Aware Contrastive Learning for Source Code

https://doi.org/10.1145/3597926.3598035

Ding, Yangruibo; Chakraborty, Saikat; Buratti, Luca; Pujar, Saurabh; Morari, Alessandro; Kaiser, Gail; Ray, Baishakhi (July 2023, 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA))

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.
more » « less
Full Text Available
Learning Computational Thinking Efficiently with Block-based Parsons Puzzles

Bender, Jeff; Dziena, Alex; Kaiser, Gail (November 2022, 30th International Conference on Computers in Education (ICCE))

To investigate learning system elements and progressions that affect computational thinking (CT) learning in block-based environments, we developed a Parsons Programming Puzzle (PPP) module within Scratch with scaffolding customized via a novel Blockly grammar. By varying the presentation and types of feedback encountered between- and within-subjects in a study of 579 adults, we identified features and scaffolding strategies that yield manageable cognitive load (CL), improved CT learning efficiency, and increased motivation, for a general populace. Findings indicate: 1) PPPs with feedback induce lowest CL; 2) an isolated palette, correctness feedback, and fading correctness feedback increase learning efficiency; 3) fading scaffolding can increase CT motivation. We analyze 12 conditions to provide insight to those developing block-based PPP systems with the aim to advance equitable CT education for all.
more » « less
Full Text Available

« Prev Next »

Search for: All records