skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on September 15, 2026

Title: Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis
Large Language Models (LLMs) are increasingly used to automate software development, yet most prior evaluations focus on functional correctness or high-level languages such as Python. As one of the first systematic explorations of LLM-assisted software performance engineering, we present a comprehensive study of LLMs’ ability to generate efficient C implementations of graph-analysis routines—code that must satisfy stringent runtime and memory constraints. This emerging field of LLM-assisted algorithm engineering holds significant promise, as these models may possess the capability to design novel approaches that improve existing algorithms and their implementations. Eight state-of-the-art models (OpenAI ChatGPT o3 and o4-mini-high, Anthropic Claude 4 Sonnet and Sonnet Extended, Google Gemini 2.5 Flash and Pro, xAI Grok 3-Think, and DeepSeek DeepThink R1) are benchmarked using two distinct approaches. The first approach evaluates the ability of LLMs to generate algorithms that outperform existing benchmarks. The second approach assesses their capability to generate graph algorithms for integration into performance-critical systems. The results show that Claude Sonnet 4 Extended achieves superior performance in ready-to-use code generation and efficiency, outperforming human-written baselines in triangle counting. Although our findings demonstrate that contemporary LLMs excel in optimizing and integrating established algorithms, the potential for these models to eventually invent transformative algorithmic techniques represents a compelling frontier for future research. We provide prompts, generated code, and measurement scripts to promote reproducible research in this rapidly evolving domain. All of the source code is available on GitHub at https://github.com/Bader-Research/LLM-triangle-counting/.  more » « less
Award ID(s):
2109988 2402560 2453324
PAR ID:
10655199
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
1 to 7
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), demonstrating significant capabilities in processing and understanding text data. However, recent studies have identified limitations in LLMs’ ability to manipulate, program, and reason about structured data, especially graphs. We introduce GraphEval36K1 , the first comprehensive graph dataset, comprising 40 graph coding problems and 36,900 test cases to evaluate the ability of LLMs on graph problem solving. Our dataset is categorized into eight primary and four sub-categories to ensure a thorough evaluation across different types of graphs. We benchmark ten LLMs, finding that private models outperform open-source ones, though the gap is narrowing. We also analyze the performance of LLMs across directed vs undirected graphs, different kinds of graph concepts, and network models. Furthermore, to improve the usability of our evaluation framework, we propose Structured Symbolic Decomposition (SSD), an instruction-based method designed to enhance LLM performance on complex graph tasks. Results show that SSD improves the average passing rate of GPT-4, GPT4o, Gemini-Pro and Claude-3-Sonnet by 8.38%, 6.78%, 29.28% and 25.28%, respectively. 
    more » « less
  2. Introduction: Because developing integrated computer science (CS) curriculum is a resource-intensive process, there is interest in leveraging the capabilities of AI tools, including large language models (LLMs), to streamline this task. However, given the novelty of LLMs, little is known about their ability to generate appropriate curriculum content. Research Question: How do current LLMs perform on the task of creating appropriate learning activities for integrated computer science education? Methods: We tested two LLMs (Claude 3.5 Sonnet and ChatGPT 4-o) by providing them with a subset of national learning standards for both CS and language arts and asking them to generate a high-level description of learning activities that met standards for both disciplines. Four humans rated the LLM output – using an aggregate rating approach – in terms of (1) whether it met the CS learning standard, (2) whether it met the language arts learning standard, (3) whether it was equitable, and (4) its overall quality. Results: For Claude AI, 52% of the activities met language arts standards, 64% met CS standards, and the average quality rating was middling. For ChatGPT, 75% of the activities met language arts standards, 63% met CS standards, and the average quality rating was low. Virtually all activities from both LLMs were rated as neither actively promoting nor inhibiting equitable instruction. Discussion: Our results suggest that LLMs are not (yet) able to create appropriate learning activities from learning standards. The activities were generally not usable by classroom teachers without further elaboration and/or modification. There were also grammatical errors in the output, something not common with LLM-produced text. Further, standards in one or both disciplines were often not addressed, and the quality of the activities was often low. We conclude with recommendations for the use of LLMs in curriculum development in light of these findings. 
    more » « less
  3. As large language models (LLMs) expand the power of natural language processing to handle long inputs, rigorous and systematic analyses are necessary to understand their abilities and behavior. A salient application is summarization, due to its ubiquity and controversy (e.g., researchers have declared the death of summarization). In this paper, we use financial report summarization as a case study because financial reports are not only long but also use numbers and tables extensively. We propose a computational framework for characterizing multimodal long-form summarization and investigate the behavior of Claude 2.0/2.1, GPT-4/3.5, and Cohere. We find that GPT-3.5 and Cohere fail to perform this summarization task meaningfully. For Claude 2 and GPT-4, we analyze the extractiveness of the summary and identify a position bias in LLMs. This position bias disappears after shuffling the input for Claude, which suggests that Claude seems to recognize important information. We also conduct a comprehensive investigation on the use of numeric data in LLM-generated summaries and offer a taxonomy of numeric hallucination. We employ prompt engineering to improve GPT-4's use of numbers with limited success. Overall, our analyses highlight the strong capability of Claude 2 in handling long multimodal inputs compared to GPT-4. The generated summaries and evaluation code are available at https://github.com/ChicagoHAI/characterizing-multimodal-long-form-summarization. 
    more » « less
  4. Identifying, localizing, and resolving bugs in software engineering is challenging and costly. Approaches to resolve software bugs range from Large Language Model (LLM) code analysis and repair, and automated code repair technology that aims to alleviate the technical burden of difficult to solve bugs. We propose RAGFix, which enhances LLM’s capabilities for bug localization and code repair using Retrieval Augmented Generation (RAG) based on dynamically collected Stack Overflow posts. These posts are searchable via a Question and Answer Knowledge Graph (KGQA). We evaluate our method on the HumanEvalFix benchmark for Python using relevant closed and open-source models. Our approach facilitates error resolution in Python coding problems by creating a searchable, embedded knowledge graph representation of bug and solution information from Stack Overflow, interlinking bugs, and solutions through semi-supervised graph construction methods. We use cosine similarity on embeddings based on LLM-synthesized summaries and algorithmic features describing the coding problem and potential solution to find relevant results that improve LLM in-context performance. Our results indicate that our system enhances small open-source models’ ability to effectively repair code, particularly where these models have less parametric knowledge about relevant coding problems and can leverage nonparametric knowledge to provide accurate, actionable fixes. 
    more » « less
  5. Identifying, localizing, and resolving bugs in software engineering is challenging and costly. Approaches to resolve software bugs range from Large Language Model (LLM) code analysis and repair, and automated code repair technology that aims to alleviate the technical burden of difficult to solve bugs. We propose RAGFix, which enhances LLM’s capabilities for bug localization and code repair using Retrieval Augmented Generation (RAG) based on dynamically collected Stack Overflow posts. These posts are searchable via a Question and Answer Knowledge Graph (KGQA). We evaluate our method on the HumanEvalFix benchmark for Python using relevant closed and open-source models. Our approach facilitates error resolution in Python coding problems by creating a searchable, embedded knowledge graph representation of bug and solution information from Stack Overflow, interlinking bugs, and solutions through semi-supervised graph construction methods. We use cosine similarity on embeddings based on LLM-synthesized summaries and algorithmic features describing the coding problem and potential solution to find relevant results that improve LLM in-context performance. Our results indicate that our system enhances small open-source models’ ability to effectively repair code, particularly where these models have less parametric knowledge about relevant coding problems and can leverage nonparametric knowledge to provide accurate, actionable fixes. 
    more » « less