skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 16, 2026

Title: Generating Frequently Asked Questions from Technical Support Tickets using Large Language Models
High-performance computing is a driving force behind scientific innovation and discovery. However, as the number of users and the complexity of high-performance computing systems grow, so does the volume and variability of technical issues handled by sup- port teams. The evolving nature of these issues presents a need for automated tools that can extract clear, accurate, and relevant fre- quently asked questions directly from support tickets. This need was addressed by developing a novel pipeline that incorporates seman- tic clustering, representation learning, and large language models. While prior research laid strong foundations across classification, clustering and large language model-based questions & answers, our work augments these efforts by integrating semantic clustering, domain-specific summarization, and multi-stage generation into a scalable pipeline for autonomous technical support. To prioritize high-impact issues, the pipeline began by filtering tickets based on anomaly frequency and recency. It then leveraged an instruction- tuned large language model to clean and summarize each ticket into a structured issue-resolution pair. Next, unsupervised semantic clus- tering was performed to identify subclusters of semantically similar tickets within broader topic clusters. A large language model-based generation module was then applied to create frequently asked questions representing the most dominant issues. A structured evaluation by subject matter experts indicated that our approach transformed technical support tickets into understandable, factu- ally sound, and pertinent frequently asked questions. The ability to extract fine-grained insights from raw ticket data enhances the scalability, efficiency, and responsiveness of technical support work- flows in high-performance computing environments, ultimately enabling faster troubleshooting and more accessible pathways to scientific discovery.  more » « less
Award ID(s):
2005632
PAR ID:
10639608
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
HUST SC Workshops ’25, November 16–21, 2025, St Louis, MO, USA
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transfer learning leverages feature representations of deep neural networks (DNNs) pretrained on source tasks with rich data to empower effective finetuning on downstream tasks. However, the pre-trained models are often prohibitively large for delivering generalizable representations, which limits their deployment on edge devices with constrained resources. To close this gap, we propose a new transfer learning pipeline, which leverages our finding that robust tickets can transfer better, i.e., subnetworks drawn with properly induced adversarial robustness can win better transferability over vanilla lottery ticket subnetworks. Extensive experiments and ablation studies validate that our proposed transfer learning pipeline can achieve enhanced accuracy-sparsity trade-offs across both diverse downstream tasks and sparsity patterns, further enriching the lottery ticket hypothesis. 
    more » « less
  2. With each successive election since at least 1994, congressional elections in the United States have transitioned toward nationalized two-party government. Fewer voters split their tickets for different parties between President and Congress. Regional blocs and incumbency voting --- a key feature of U.S. elections in the latter 20th century --- appear to have given way to strong party discipline among candidates and nationalized partisanship among voters. Observers of modern American politics are therefore tempted to write off the importance of the swing voter, defined here as voters who are indifferent between the two parties and thus likely to split their ticket or switch their party support. By assembling data from historical elections (1950 -- 2020), surveys (2008 -- 2018), and cast vote record data (2010 -- 2018), and through developing statistical methods to analyze such data, I argue that although they comprise a smaller portion of the electorate, each swing voter is disproportionately decisive in modern American politics, a phenomenon I call the swing voter paradox. Historical comparisons across Congressional, state executive, and state legislative elections confirm the decline in aggregate measures of ticket splitting suggested in past work. But the same indicator has not declined nearly as much in county legislative or county sheriff elections (Chapter 1). Ticket splitters and party switchers tend to be voters with low news interest and ideological moderate. Consistent with a spatial voting model with valence, voters also become ticket splitters when incumbents run (Chapter 2). I then provide one of the first direct measures of ticket splitting instate and local office using cast vote records. I find that ticket splitting is more prevalent in state and local elections (Chapter 3). This is surprising given the conventional wisdom that party labels serve as heuristics and down-ballot elections are low information environments. A major barrier for existing studies of the swing voter lies in the measurement from incomplete electoral data. Traditional methods struggle to extract information about subgroups from large surveys or cast vote records, because of small subgroup samples, multi-dimensional data, and systematic missingness. I therefore develop a procedure for reweighting surveys to small areas through expanding poststratification targets (Chapter 4), and a clustering algorithm for survey or ballot data with multiple offices to extract interpretable voting blocs (Chapter 5). I provide open-source software to implement both methods. These findings challenge a common characterization of modern American politics as one dominated by rigidly polarized parties and partisans. The picture that emerges instead is one where swing voters are rare but can dramatically decide the party in power, and where no single demographic group is a swing voter. Instead of entrenching elections into red states and blue states, nationalization may heighten the role of the persuadable voter. 
    more » « less
  3. To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery. However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. Here we used an information extraction pipeline that won first place in the LitCoin Natural Language Processing Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG’s comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of automated knowledge discovery, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified around 1,200 candidate drugs in the first 4 months, with one-third of those discovered in the first 2 months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools. 
    more » « less
  4. A vast proportion of scientific data remains locked behind dynamic web interfaces, often called the deep web—inaccessible to conventional search engines and standard crawlers. This gap between data availability and machine usability hampers the goals of open science and automation. While registries like FAIRsharing offer structured metadata describing data standards, repositories, and policies aligned with the FAIR (Findable, Accessible, Interoperable, and Reusable) principles, they do not enable seamless, programmatic access to the underlying datasets. We present FAIRFind, a system designed to bridge this accessibility gap. FAIRFind autonomously discovers, interprets, and operationalizes access paths to biological databases on the deep web, regardless of their FAIR compliance. Central to our approach is the Deep Web Communication Protocol (DWCP), a resource description language that represents web forms, HyperText Markup Language (HTML) tables, and file-based data interfaces in a machine-actionable format. Leveraging large language models (LLMs), FAIRFind combines a specialized deep web crawler and web-form comprehension engine to transform passive web metadata into executable workflows. By indexing and embedding these workflows, FAIRFind enables natural language querying over diverse biological data sources and returns structured, source-resolved results. Evaluation across multiple open-source LLMs and database types demonstrates over 90% success in structured data extraction and high semantic retrieval accuracy. FAIRFind advances existing registries by turning linked resources from static references into actionable endpoints, laying a foundation for intelligent, autonomous data discovery across scientific domains. 
    more » « less
  5. Synthesizing information from collections of tables embedded within scientific and technical documents is increasingly critical to emerging knowledge-driven applications. Given their structural heterogeneity, highly domain-specific content, and diffuse context, inferring a precise semantic understanding of such tables is traditionally better accomplished through linking tabular content to concepts and entities in reference knowledge graphs. However, existing tabular data discovery systems are not designed to adequately exploit these explicit, human-interpretable semantic linkages. Moreover, given the prevalence of misinformation, the level of confidence in the reliability of tabular information has become an important, often overlooked, factor in the discovery over open datasets. We describe a preliminary implementation of a discovery engine that enables table-based semantic search and retrieval of tabular information from a linked knowledge graph of scientific tables. We discuss the viability of semantics-guided tabular data analysis operations, including on-the-fly table generation under reliability constraints, within discovery scenarios motivated by intelligence production from documents. 
    more » « less