<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>06/03/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10650806</idno>
					<idno type="doi">10.1109/IPDPS64566.2025.00036</idno>
					
					<author>Chris Egersdoerfer</author><author>Arnav Sareen</author><author>Jean Luca Bez</author><author>Suren Byna</author><author>Dongkuan DK Xu</author><author>Dong Dai</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>High-performance computing (HPC) clusters play a crucial role in enabling modern science, supporting a wide range of scientific applications and services across various domains, such as cosmology <ref type="bibr">[1]</ref>, quantum simulation <ref type="bibr">[19]</ref>, astronomy <ref type="bibr">[17]</ref>, climate modeling <ref type="bibr">[22]</ref>, and life sciences <ref type="bibr">[16]</ref>.</p><p>In recent years, these scientific applications have become increasingly data-intensive, necessitating the efficient utilization of HPC I/O subsystems to meet performance requirements. However, effectively leveraging the complex I/O stacks in HPC systems, which include high-level parallel I/O libraries (e.g., HDF5 <ref type="bibr">[13]</ref>, PnetCDF <ref type="bibr">[24]</ref>), API interfaces (e.g., MPI-IO, POSIX, STDIO), and file systems (e.g., Lustre <ref type="bibr">[34]</ref>, BurstBuffer <ref type="bibr">[14]</ref>), remains a challenging task for most domain scientists. Large-scale scientific applications often experience slow I/O performance due to inefficient use of data access APIs, misconfigurations, or the failure to apply key optimizations available in the storage systems <ref type="bibr">[36,</ref><ref type="bibr">40]</ref>.</p><p>One proven approach to assist scientists in optimizing their I/O performance is to record the I/O traces of their applications for post-hoc analysis <ref type="bibr">[10,</ref><ref type="bibr">52]</ref>. Real-time I/O profiling tools, such as Darshan <ref type="bibr">[7]</ref> and Recorder <ref type="bibr">[44]</ref>, have been developed and deployed in modern HPC facilities to collect detailed I/O traces. These traces provide comprehensive information about the I/O behavior of applications, including access patterns, I/O sizes, and timing data to help scientists understand their application's I/O behaviors. Additional tools, such as PyDarshan <ref type="bibr">[31]</ref> and DXT-Explorer <ref type="bibr">[4]</ref>, have been introduced to interpret these traces, identify potential I/O issues, and even offer optimization suggestions <ref type="bibr">[11,</ref><ref type="bibr">51]</ref>.</p><p>Despite these advancements, diagnosing I/O issues from application traces often requires the involvement of human I/O experts, who possess the specialized knowledge needed to interpret trace data due to the inherent complexity of modern HPC storage systems. With the increasing number of scientists from various domains developing HPC applications, the shortage of readily available I/O expertise presents a significant barrier. Our interviews with I/O experts from NERSC also confirm that a substantial backlog of applications with I/O performance issues is frequently awaiting analysis, with no readily available means to absolve such complications.</p><p>This gap is further exacerbated by the growing complexity of HPC systems and the vast amounts of data generated by modern applications. As a result, the difficulty in identifying and resolving I/O performance bottlenecks limits the potential of scientists and leads to inefficient use of computational resources. Therefore, there is an urgent need for an automated tool that can democratize access to I/O optimization expertise.</p><p>Recent developments in large language models (LLMs) like ChatGPT and Claude have shown some promises in addressing such challenges. Pre-trained on large datasets, these models demonstrate ingenious abilities in understanding and generating human-like text, making them highly accessible. Their in-context learning ability <ref type="bibr">[38]</ref>, combined with prompt engineering techniques <ref type="bibr">[6]</ref> and Chain-of-Thought (CoT) reasoning <ref type="bibr">[47]</ref>, enables them to follow instructions and incorporate new information. Real-world examples span areas such as medical questioning <ref type="bibr">[28]</ref>, military simulation and strategy <ref type="bibr">[37]</ref>, log-based anomaly detection <ref type="bibr">[11]</ref>, and educational guidance <ref type="bibr">[48]</ref>. Furthermore, recent advancements in Retrieval-Augmented Generation (RAG) allow LLMs to integrate external information from knowledge bases during the generation process <ref type="bibr">[23]</ref>, significantly enhancing their ability to produce accurate or contextually appropriate responses by incorporating information beyond their pre-training. The capability of continuous interactions between LLM and users is another unique capability, making it extremely useful as an assistant in many tasks.</p><p>These distinctive and compelling capabilities of LLMs illuminate a path to democratize domain scientists' access to HPC I/O optimization. Given an I/O trace, LLMs may automate the I/O performance diagnostic process, making HPC I/O optimization expertise more accessible to domain scientists and removing the necessity for a human expert. This can ultimately accelerate scientific discovery and enhance computational efficiency.</p><p>However, realizing such an ambitious goal is not straightforward, with traditional LLMs facing multiple significant challenges that inhibit their ability to interpret, analyze, and process I/O traces. First, although traces such as those generated by Darshan can be parsed into a human-readable format and hence are directly interpreted by LLMs, their lengths often surpass millions of lines which exceeds typical LLMs' context window size. Due to this, directly querying LLMs with such traces leads to a multitude of problems and in many cases, an inability to leverage the model entirely. For example, LLMs are known to encounter issues with long context windows, such as lost-in-the-middle truncation, where LLMs truncate the input, resulting in much of the information contained within center of the context being ignored in favor of text at extremities of the document <ref type="bibr">[26]</ref>, and losing long-range dependencies, where LLMs struggle to maintain coherence across distant parts of the input, preventing them from effectively modeling dependencies across lengthy documents <ref type="bibr">[5]</ref>. Unfortunately, in practice, critical information about I/O issues could be located anywhere in the trace, such as inefficient I/O behavior for a subset of files in the middle of the application execution. Additionally, many I/O issues can only be identified by correlating multiple parts of the I/O trace, such as the amount of MPI-IO vs. the amount of POSIX IO, making it challenging, even for state-of-the-art LLMs to accurately diagnose I/O problems, as exemplified in the next section.</p><p>Second, diagnosing I/O performance issues from I/O traces requires extensive domain knowledge, as these issues may be deeply rooted in specific I/O libraries or embedded within the nuances of storage systems. However, such domain-specific knowledge might not be available to LLMs during their training stage, especially considering that new findings are continuously being published. Even if domain knowledge is available to LLMs during training, it may not be effectively utilized due to the vast corpus that these models are exposed to, which typically encompasses an extensive amount of general information, thus diluting the specificity necessary to effectively tackle domain-specific tasks (as demonstrated in an example later). While this generality is typically a prominent strength of modern LLMs, it significantly limits its capability to act with expert-level proficiency in a particular domain.</p><p>Third, LLMs are known to produce hallucinations, plausible but factually incorrect or non-sensical outputs, which are unacceptable for use as a diagnosis tool by domain scientists <ref type="bibr">[18]</ref>. Although opting for more advanced models such as GPT-4o instead of simpler versions such as GPT-4 or open-source models like Llama <ref type="bibr">[42]</ref>, can help reduce hallucinations <ref type="bibr">[55]</ref>, larger models consequently introduce higher costs <ref type="bibr">[33]</ref>. As a tool targeting general scientists working at the HPC scale, relying on expensive models is impractical. Therefore, addressing the hallucination challenge independently of an individual model is crucial to enable an LLM-based tool at the extensive magnitude commonly present in HPC applications.</p><p>In this study, we explore the feasibility and strategy of using large language models to provide trustworthy, expertlevel HPC I/O performance diagnosis, thereby effectively guiding domain scientists in optimizing their I/O performance. To this end, we introduce IOAgent, an LLM-based agent designed to analyze applications' I/O traces. IOAgent incorporates key new designs to address the three challenges discussed earlier. First, IOAgent implements a module-based trace preprocessing strategy that groups relevant I/O activities together, preventing LLMs from truncating vital context or missing key I/O modules. Additionally, IOAgent introduces a set of summarization methods that accurately extract needed metadata from the trace file, rather than relying solely on the limited capabilities of LLMs for metadata retrieval. Second, IOAgent utilizes Retrieval-Augmented Generation (RAG) to integrate expert-level I/O knowledge into the diagnosis process. This allows IOAgent to provide references for its diagnoses, helping to avoid popular but incorrect claims while also increasing the transparency of the diagnosis process. Third, IOAgent employs a tree-based merging mechanism with self-reflection to minimize hallucinations in its diagnosis. These mechanisms enable IOAgent to provide domain scientists with accurate diagnostic summaries of I/O issues based on Darshan I/O traces, akin to consulting with a human expert, even if the scientists utilize open-source LLMs such as Llama.</p><p>To evaluate IOAgent and similar tools that may be developed in the future, we created a new benchmark suite for I/O issue diagnosis, named TraceBench. TraceBench includes over 40 Darshan traces collected from both I/O benchmarks and real-world applications, with each trace annotated with issues identified by I/O experts, and each I/O issue confirmed by at least two experts. By evaluating the diagnostic results with these standard labels, one can assess and compare the quality of diagnostic tools from a variety of sources. Using Tracebench, we demonstrated that IOAgent outperforms state-of-the-art I/O diagnostic tools, such as Drishti <ref type="bibr">[3]</ref> and ION <ref type="bibr">[9]</ref> in terms of accuracy, clarity, and coverage.</p><p>The core value of IOAgent lies not only in diagnosing HPC I/O performance issues but also in its ability to provide valuable design choices on how to leverage powerful but difficult-to-manage LLMs for production-level systems, where both accuracy and cost are of utmost importance. We believe that similar tools can be developed to democratize other optimization capabilities in the HPC environment, such as computation and networking in the near future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND AND RELATED WORK</head><p>In this section, we provide an overview of HPC I/O profiling tools, focusing on Darshan and its extensions. We also discuss how researchers have utilized these tools to understand HPC I/O characteristics and its typical workflow. Subsequently, we introduce Drishti as a state-of-the-art I/O issue diagnosis tool. Finally, we provide background knowledge on large language models, their reasoning capabilities, and the concept of Retrieval-Augmented Generation (RAG).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Darshan: HPC I/O Profiling Tool</head><p>Efficient I/O performance is critical for HPC applications, and as a result, understanding I/O behavior is essential for optimization. Multiple I/O profiling tools have been developed to facilitate this understanding, including STAT <ref type="bibr">[2]</ref>, mpiP <ref type="bibr">[43]</ref>, IOPin <ref type="bibr">[20]</ref>, Recorder <ref type="bibr">[44]</ref>, and Darshan <ref type="bibr">[7]</ref>. Among these, Darshan has gained widespread adoption in the HPC community due to its lightweight design and holistic descriptions of applications <ref type="bibr">[32]</ref>.</p><p>Darshan operates by instrumenting HPC applications to record key statistical metrics related to their I/O activities. In particular, it traces essential information for each file accessed across different I/O interfaces, including POSIX (Portable Operating System Interface) I/O, MPI (Message Passing Interface) I/O, and Standard I/O. The metrics collected encompass multiple aspects and can be mainly summarized as follows: 1) data volume, amount of data read from and written to each file, 2) operation counts, number of read, write, and metadata operations performed by the application, 3) temporal information, aggregate time spent on read, write and metadata operations, 4) rank information, identification of the MPI ranks issuing I/O requests, and 5) variability metrics, variance of I/O sizes and timing among different application ranks.</p><p>Darshan also collects file system-specific metrics, such as Lustre stripe widths and Object Storage Target (OST) IDs over which a file is striped. This comprehensive data provides valuable insights into the I/O patterns and performance characteristics of HPC applications.</p><p>Darshan eXtended Tracing (DXT) <ref type="bibr">[49]</ref> is a recent development based on Darshan to capture fine-grained records of an application's I/O operations, covering details such as specific files involved in I/O operations, each read/write operation, offset, and length of each I/O request, the start and end times of each I/O operation, and MPI rank IDs. Researchers have utilized Darshan DXT to analyze I/O behaviors across a variety of applications and systems <ref type="bibr">[35]</ref>. However, since Darshan DXT introduces more noticeable overheads to HPC applications, it is typically not enabled by default. Hence, in this study, we focus only on the original Darshan I/O traces and leave working with Darshan DXT traces as future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Advanced I/O Issue Diagnosis: Drishti and ION</head><p>While tools like Darshan provide the raw data necessary for I/O analysis, interpreting this data to diagnose performance issues remains challenging. Several tools have been developed to address this gap, including IOMiner <ref type="bibr">[45]</ref>, UMAMI <ref type="bibr">[30]</ref>, TOKIO <ref type="bibr">[29]</ref>, DXT-Explorer <ref type="bibr">[4]</ref>, recorderviz <ref type="bibr">[44]</ref>, and Drishti <ref type="bibr">[3]</ref>. Of these, Drishti represents the most recent advancement in I/O issue diagnosis.</p><p>Drishti takes a Darshan trace as input and conducts an analysis to report various I/O performance issues based on a set of heuristic-based triggers. Currently, Drishti includes a set of 30 triggers corresponding to different application behaviors and can identify nine distinct types of I/O issues, such as small I/O operations (excessively small read/write requests), misaligned I/O (I/Os not aligned with the file system's block size), and imbalanced I/O (uneven distribution of I/O workload across ranks).</p><p>Due to its lightweight design of checking key triggers in Darshan traces, Drishti can be an effective tool for system administrators or scientists to quickly scan a large number of traces and identify applications with I/O issues. However, it has notable limitations when it comes to providing detailed I/O diagnoses for domain scientists working on their applications. First, Drishti relies on hard-coded threshold values (determined via expert knowledge and observations from past experience) for its triggers, which may not be accurate for all applications. For instance, it flags write requests smaller than 1MB as small writes and raises an issue if more than 10% of all requests are small I/Os. While 10% is a reasonable threshold for general applications, this can be perplexing to domain scientists if their application only exhibits a minute percentage of such operations, with the resulting performance degradation being negligible. Domain scientists would benefit far more from a tool that provides personalized, per-application explanations for each diagnosis. Second, Drishti's explanations and recommendations are predefined, hard-coded messages tied to their specific triggers. This approach lacks the nuanced and context-specific reasoning needed for domain scientists to fully understand and address performance problems. Lastly, Drishti does not offer an interactive interface for users to ask follow-up questions or further explore the analysis. This reduces its utility as a learning and exploratory tool, especially for domain scientists who may not have extensive I/O expertise.</p><p>These limitations motivated ION <ref type="bibr">[9]</ref>, a recent study that also uses large language models to diagnose HPC I/O issues. As a proof-of-concept work, ION directly queries the LLMs with engineered prompts to generate diagnoses. This strategy, however, means ION will heavily rely on the capability of the selected LLMs as well as suffer from their shortcomings, which will be further discussed later in Section III. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Large Language Models and Their Capabilities</head><p>Recent advancements in artificial intelligence have led to the development of large language models (LLMs) like ChatGPT, Claude AI, and Gemini <ref type="bibr">[41]</ref>, which demonstrate remarkable capabilities in understanding and generating humanesque text. One of the key strengths of LLMs is their ability to follow instructions and address various tasks using their in-context learning capability. They can perform complex tasks such as summarization <ref type="bibr">[53]</ref>, translation <ref type="bibr">[6]</ref>, and question answering <ref type="bibr">[8]</ref>, and even exhibit emergent abilities like mathematics and programming reasoning <ref type="bibr">[46]</ref>. This makes them well-suited for applications that require comprehension and synthesis of information across diverse domains.</p><p>Another significant strength of LLMs is their enhancement via the utilization of Retrieval-Augmented Generation (RAG) <ref type="bibr">[23]</ref>. RAG combines LLMs with external knowledge bases or documents, enabling the model to retrieve relevant information during the generation process. This approach enhances the model's ability to provide accurate and up-todate information, particularly in specialized domains where the model's pre-training data may be insufficient and advancements are made rapidly.</p><p>In the context of HPC I/O analysis, LLMs offer a promising opportunity to automate the diagnostic process. However, applying LLMs to interpret I/O trace data and pinpoint I/O issues requires addressing several critical challenges, as discussed earlier. Recognizing both the upside potential and challenges of applying LLMs to HPC I/O analysis, our goal in this study is to develop a solution that harnesses the strengths of LLMs while mitigating their limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PRELIMINARY STUDY: PLAIN LLM DIAGNOSIS</head><p>In this section, we first report the diagnostic results of a real-world HPC application's Darshan trace using plain language models. For this, we collected the Darshan trace of AMReX running at the NERSC HPC center. AMReX <ref type="bibr">[54]</ref> is a widely used framework for highly concurrent, blockstructured adaptive mesh refinement. This AMReX execution took around 722 seconds to finish, used 8 processes, and read/write 11 files to a Lustre file system mounted at /scratch.</p><p>Since the original Darshan trace is in binary format, we first used darshan-parser to convert it into a text-based, human-readable format. We then queried various language models using the prompt shown in Figure <ref type="figure">1</ref>. Due to space constraints, we omit the outputs of open-source models such as Llama <ref type="bibr">[42]</ref>, as the quality of their results is significantly lower compared to OpenAI's GPT models.</p><p>The We could not present the results from the latest OpenAI o1-preview model due to its limited context window, which is too small to process the complete AMReX Darshan trace. Our evaluations using smaller Darshan traces show that o1-preview produces outputs of similar quality to the 4o model. However, the high cost of o1-preview makes it largely impractical for our large-scale use.</p><p>These issues highlight the limitations of naively applying large language models to the I/O performance diagnosis task. We propose IOAgent, an LLM-based I/O diagnosis tool designed with key features to overcome these deficiencies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. IOAGENT DESIGN AND IMPLEMENTATION</head><p>In order to address the aforementioned challenges which non-augmented LLMs face when analyzing I/O trace logs, IOAgent mainly consists of three primary components. The first component, the module-based pre-processor primarily handles the context-length challenge faced by LLMs and guides IOAgent towards effective downstream source retrieval. The second component, the Domain Knowledge Integrator alleviates the non-augmented LLM's gaps in specific domain expertise with up-to-date and relevant information. The final component, the Tree-Merge (with self-reflection), enables complex summarization for both frontier and shortof-frontier language models to form comprehensive and interpretable I/O performance diagnoses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Module-based Pre-processor</head><p>To address the challenge of integrating HPC trace logs into the limited context window of LLMs, IOAgent uses a log pre-processor based on two key insights. First, effective and comprehensive reasoning over a trace log requires access to data from all key modules used by the application. To meet this need, IOAgent's pre-processor separates the Darshan log into a set of CSV files, with each file containing the counters and values from a single Darshan module.</p><p>Second, since module data may also be extensive, it must be reduced to a manageable length for the LLM to interpret. IOAgent accomplishes this through summary extraction functions for each module, which generate categorized summary fragments. The categories of these summaries are outlined by the column titles in Table <ref type="table">I</ref>. Due to the variance in counters accumulated by Darshan for each supported module, each module extracts varying categories of summary information. Due to these deviations, each module's summary category uses its own extraction function based on the available counters, but the information follows consistent principles across categories. For example, the I/O volume function for the STDIO module extracts the total amount of read and written bytes, while for LUSTRE, it focuses on information about mount points, stripe settings, and server usage. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Domain Knowledge Integrator</head><p>IOAgent uses Retrieval-Augmented Generation (RAG) to integrate domain knowledge to provide trustworthy diagnosis for HPC I/O issues. There are three key questions while using RAG: 1) how to construct the queries, 2) how to build the vector database, 3) and what are the key vector search parameters. We describe each of them in detail below.</p><p>1) RAG Query Construction: RAG provides a database for the LLM to query and retrieve useful and relevant information for their current task. Hence, the quality of the query to the RAG database matters significantly to retrieve the needed information.</p><p>After module-based pre-processing, we split module information into CSV files and compress it into manageable JSON summary fragments, which helps avoid overwhelming the limited context of LLMs. Each JSON fragment represents specific information from a Darshan module, making it easier to locate relevant information using techniques like cosine similarity between text embeddings. However, JSON summaries are still not in the same format as expert knowledge, which is often communicated through research papers. This makes direct embedding searches over domainspecific knowledge ineffective.</p><p>To address this, IOAgent transforms each JSON summary into a natural language format for better alignment for the domain knowledge retrieval using language-based embedding similarity. This transformation is done by prompting an LLM with the code used to generate the summary, the JSON summary values, and a broader application context (total runtime, number of processes, and I/O size proportions from Darshan modules). An example of this transformation is shown in Figure <ref type="figure">3</ref>, where the LLM provides a descriptive interpretation of the I/O size summary from the POSIX module. The natural language response aligns better with domain knowledge, improving the accuracy of embedding similarity searches. For instance, the JSON summary may contain: "read histogram: {'0-100': 1.0}", but the corresponding natural language would become: "...the value of 1.0 in the 1 to 100 bin indicates that 100% of the read operation fall within the 0 KB to 100 KB range". The latter representation makes it much easier to match relevant studies in publications.</p><p>2) Vector Database Creation: The quality of the RAG database profoundly impacts the accuracy of subsequent diagnoses. However, to the best of our knowledge, there is no existing collection of text embeddings specifically targeting up-to-date HPC I/O performance diagnosis. Creating such a resource poses challenges, including the limits on how much data can be feasibly collected, embedded, and stored, which requires filtering and uncertainty about the best data sources. To address these issues and gather relevant, current knowledge on HPC I/O performance, we surveyed the past five years of research using the query 'HPC I/O Performance' and 'I/O Performance issue' in the ACM Digital Library and IEEE Xplore databases. From the top 50 results in each, we manually filtered for relevance, yielding 66 key works.</p><p>To process these works into a vector index for retrieval by IOAgent, we use LlamaIndex <ref type="bibr">[25]</ref>, a popular opensource framework for vector index creation. LlamaIndex allows for configuring common hyperparameters such as the embedding model, chunk size, and overlap between chunks. We found retrieval accuracy and relevance to be consistent with reasonable variations in these settings, so we used the default configuration: a chunk size of 512, an overlap of 20, measured distance between embeddings via cosine similarity, and the text-embedding-3-large model from OpenAI.</p><p>3) Vector Database Search: With the vector index implemented as described, IOAgent conducts a vector search to match specific, related domain knowledge to each summary fragment as shown in Figure <ref type="figure">2</ref>. Since in many cases the most useful information available in the vector index for a given summary fragment may not be the most pertinent and potential additional context provided by other highly ranked but not highest ranked matches may still be incredibly useful for an accurate diagnosis, IOAgent retrieves the top 15 closest matches from the index. However, since this</p><note type="other">Label Description High Metadata Load</note><p>The application spends a significant amount of time performing metadata operations (e.g., directory lookups, file system operations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Misaligned [Read|Write] Requests</head><p>The application makes read or write requests that are not aligned with the file system's stripe boundaries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random Access Patterns on [Read|Write]</head><p>The application issues read or write requests in a random access pattern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Shared File Access</head><p>The application has multiple processes or ranks accessing the same file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Small [Read|Write] I/O Requests</head><p>The application is making frequent read or write requests with a small number of bytes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Repetitive Data Access on Read</head><p>The application is making read requests to the same data repeatedly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Server Load Imbalance</head><p>The application issues a disproportionate amount of I/O traffic to some servers compared to others or does not properly utilize the available storage resources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rank Load Imbalance</head><p>The application has MPI ranks issuing a disproportionate amount of I/O traffic compared to others.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multi-Process Without MPI</head><p>The application has multiple processes but does not leverage MPI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>No Collective I/O on [Read|Write]</head><p>The application does not perform collective I/O on read or write operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Low-Level Library on [Read|Write]</head><p>The application relies on a low-level library like STDIO for a significant amount of read or write operations outside of loading/reading configuration or output files.</p><p>TABLE II: Table of I/O Issues and Descriptions enlarges the context significantly for any subsequent query that should analyze the content of these retrieved sources together, we implemented a self-reflection step which uses a faster and cheaper language model (e.g., gpt-4o-mini) with less reasoning capability to rule out any sources which are found not to be relevant to the given summary fragment used as the original query over the index. This source filtering is run in parallel over all retrieved sources and rules out nearly half of the retrieved sources based on a more nuanced understanding of relevance than what is provided by the vector retriever. Following this parallel filtering step, IOAgent conducts its first true diagnosis of any potential I/O performance issues based on the information in the descriptive summary fragment generated by the LLM, as exemplified in Figure <ref type="figure">3</ref>, and the filtered domain knowledge retrieved from the vector index.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Tree-based Merge</head><p>Once IOAgent has generated an initial diagnosis for each summary category in the Darshan modules using the retrieved domain knowledge, these diagnoses-and their relevant references-must be merged into a single, comprehensive I/O performance diagnosis for the entire application.</p><p>Intuitively, one could merge all of the fragmented diagnoses and their source content via a single LLM query, as it is possible that all of this information could fit within the context window of some larger models. However, as later evaluations will show, the task of effectively and consistently merging more than two diagnosis summaries is beyond the capabilities of existing LLMs, even proprietary ones. This is primarily because regardless of the size of an LLM's context window, the merging task itself requires diligent and precise reasoning. Effectively merging two diagnoses with their respective references involves removing redundant information, resolving contradictory details, and reflecting on both sets of provided sources to decide which should be retained in the new summary and where they should be cited. Additionally, adding any more summaries to the merging task beyond the minimal set of two introduces a more significant positional bias due to the increased cognitive load required in such a setting <ref type="bibr">[26,</ref><ref type="bibr">50]</ref>.</p><p>In light of the challenges associated with merging diagnosis summaries, IOAgent implements a pairwise merging strategy, in which two diagnosis fragments and their domain knowledge references are merged into a new, combined diagnosis and set of references. Since all pairs of diagnoses merged at the same level of the tree are independent of each other, all merging tasks for each level are conducted in parallel. The partial summaries are then further merged, effectively forming a tree structure, as shown in Figure <ref type="figure">2</ref>. This tree-based merging plays a critical role in delivering concise and accurate diagnoses, as later evaluation results demonstrate.</p><p>V. BENCHMARK SUITE: TRACEBENCH Accurate evaluation of HPC I/O trace analysis tools presents a challenge due to the lack of publicly available trace datasets with labeled I/O performance issues. To address this, in line with our key contributions, we constructed the TraceBench dataset, which includes a set of Darshan traces from three different sources. Each trace has been manually labeled based on a predefined set of labels covering common HPC I/O performance issues, as represented in Table <ref type="table">II</ref>.</p><p>The first data source consists of a set of Darshan logs generated by simple C source codes that intentionally introduce at least one of the I/O performance issues defined in the label set. We refer to this set as Simple-Bench. The second data source consists of a set of Darshan logs generated by the IO500 benchmark <ref type="bibr">[21]</ref>, where each configuration is designed to induce one or more I/O performance issues defined by the labels. The final data source primarily comprises real application traces, all collected on production HPC systems.</p><p>1) Simple-Bench (SB). The Simple-Bench set consists of 10 labeled Darshan logs, which were generated using 10 rudimentary scripts written in C to target at least one specific I/O performance issue category from Table <ref type="table">II</ref>. However, as shown by the sample representation across different issue categories for each dataset in Table <ref type="table">III</ref>, some traces from the Simple-Bench source contain more than one issue type. Despite this, the simplicity of these scripts results in Darshan trace logs that are small in size, with very low aggregate I/O volume and highly uniform behavior. As a result, these traces should be the easiest to diagnose accurately among all tools.</p><p>2) IO500. The second set of labeled Darshan trace logs was collected using the IO500 benchmark, a synthetic performance benchmark tool for HPC systems that simulates a broad set of commonly observed I/O patterns in HPC applications <ref type="bibr">[21]</ref>. IO500 consists of many configurable workloads, which can be tuned to simulate various sub-optimal I/O patterns. For example, IO500's ior-easy workload, which conducts intense sequential read and write phases, can be tuned to use 8k transfer sizes issued through independent POSIX operations across multiple ranks, resulting in dominant small I/O behavior that does not effectively leverage higher-level libraries such as MPI-IO for multi-rank I/O. In total, 21 traces were collected from 21 unique configurations of IO500, and as shown in Table <ref type="table">III</ref>, a significant number of traces exhibit multiple overlapping issues.</p><p>3) Real Applications (RA). The final set of labeled Darshan trace logs was generated by running real application workloads on large-scale production HPC systems. This collection consists of nine samples, each originating from a unique application, except for two samples representing recollected traces for the E2E and OpenPMD applications. In these cases, the original trace for each application possessed a significant performance issue, and the recollected traces had their primary issues resolved.</p><p>Labeled Issue SB IO500 RA Total High Metadata Load 1 2 2 5 Misaligned Read requests 2 10 4 16 Misaligned Write requests 2 10 6 18 Random Access Patterns on Write 0 5 2 7 Random Access Patterns on Read 0 5 2 7 Shared File Access 1 14 4 19 Small Read I/O Requests 2 10 5 17 Small Write I/O Requests 2 10 6 18 Repetitive Data Access on Read 1 0 0 1 Server Load Imbalance 7 15 2 24 Rank Load Imbalance 1 0 1 2 Multi-Process W/O MPI 0 13 0 13 No Collective I/O on Read 6 8 4 18 No Collective I/O on Write 5 8 2 15 Low-Level Library on Read 1 0 0 1 Low-Level Library on Write 1 0 0 1</p><p>TABLE III: Summary of traces and labeled issues.</p><p>Table III summarizes all the I/O issues included in TraceBench and how different sources, such as SB (Single-Bench), IO500, RA (Real-Application), contribute to these issues. There are 182 issues reported in 40+ traces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EVALUATION</head><p>Summary. We conducted comprehensive evaluations of IOAgent using the TraceBench test suite. The diagnosis results were compared to those from the LLM-based diagnosis tool ION <ref type="bibr">[9]</ref> and the expert-guided I/O performance diagnosis tool Drishti <ref type="bibr">[3]</ref>. Additionally, we evaluated examples of advanced user interactions enabled by IOAgent's LLMcentric design and validated our tree-based merge design, as outlined in Section IV.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Evaluation Metrics</head><p>To evaluate IOAgent, an accuracy metric, which measures how well the diagnosis matches the ground truth, is certainly the most important criterion to consider. With this objective in mind, we expect IOAgent to perform on par with state-of-theart expert-based diagnosis tools, such as Drishti. However, accuracy should not be the only critical metric evaluated. Since IOAgent is intended to assist domain scientists, the readability and understandability of the information provided in the diagnosis also become essential for users at any level of familiarity with HPC I/O. As an assistant, IOAgent should also be assessed based on how useful the information is. Following practices from the NLP community for evaluating agents <ref type="bibr">[12,</ref><ref type="bibr">27,</ref><ref type="bibr">56]</ref>, we propose the following three metrics for our evaluations:</p><p>&#8226; Accuracy: evaluate how accurately the ground truth labels are diagnosed by each tool. &#8226; Utility: evaluate how useful the information provided in each diagnosis is for understanding the overall I/O behavior of the application, identifying performance issues, and determining how to address each noted issue (regardless of the factuality of such statements). &#8226; Interpretability: evaluate how readable and understandable the provided information is for users at any level of familiarity with HPC I/O.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. LLM-based Rating System</head><p>With the metrics defined, the challenge shifts to providing a quantitative judgment of each metric. Among them, accuracy is relatively easy as we can easily count the number of matched and mismatched issues. But Utility and Interpretability are rather subjective to individuals and their experience level with HPC I/O. In this study, we use LLM as the judge and follow the common practice of using a rankingbased system to compare different solutions.</p><p>The intuition behind using a capable LLM to rate the diagnostic results is simple. Earlier results in Figure <ref type="figure">1</ref> have shown a qualified LLM, such as GPT-4o, is very capable of understanding high-level I/O relevant concepts (with misunderstanding in many details of course), which highly emulates our target users: domain scientists. We believe that letting capable LLM serve as the judge avoids personal bias, and more importantly, brings in the perspectives of domain users.</p><p>Even with a capable LLM, quantitatively providing a score on Utility or Interpretability is still not feasible. Instead, we use an anonymized rating system to compare different solutions and rank them. The ranking is done by the LLM itself by formatting a prompt with all diagnosis outputs from different tools, a description of the specified evaluation criteria, and a description of how the ranking output should be formatted. The LLM, GPT-4o in this case, will rank the diagnosis outputs of different tools on a scale of 1 to 4, with 1 being the best and 4 being the worst.</p><p>Due to the known potential for positional bias in LLMbased content ranking <ref type="bibr">[26,</ref><ref type="bibr">39]</ref> we further augment the prompt in three key ways to ensure a fair and reliable ranking result. The first augmentation is to anonymize the</p><p>TABLE IV: Performance Results for Diagnosis Tools on TraceBench Subsets Metric Diagnosis Tool Simple-Bench IO500 Real-Applications Overall Accuracy Drishti 0.398 0.480 0.472 0.459 ION 0.343 0.381 0.417 0.380 IOAgent-gpt-4o 0.630 0.655 0.620 0.641 IOAgent-llama-3.1-70B 0.620 0.488 0.463 0.513 Utility Drishti 0.426 0.417 0.491 0.436 ION 0.352 0.401 0.380 0.385 IOAgent-gpt-4o 0.565 0.615 0.639 0.609 IOAgent-llama-3.1-70B 0.694 0.587 0.565 0.607 Interpretability Drishti 0.417 0.452 0.463 0.447 ION 0.343 0.417 0.352 0.385 IOAgent-gpt-4o 0.546 0.659 0.713 0.645 IOAgent-llama-3.1-70B 0.694 0.484 0.472 0.530 Average Drishti 0.414 0.450 0.475 0.447 ION 0.346 0.399 0.383 0.383 IOAgent-gpt-4o 0.580 0.643 0.657 0.632 IOAgent-llama-3.1-70B 0.670 0.520 0.500 0.550 names of the diagnosis tools to eliminate any potential bias towards or against specific analysis tools. The second augmentation is to rotate the rank assignment order imposed by the response formatting portion of the prompt. The final augmentation is to rotate the order in which the content of each diagnosis appears in the prompt. All these measures are intended to eliminate positional bias for each diagnosis in the prompt context. These augmentations are outlined in Figure 4, denoted by the letters A, B, and C. Additionally, for each sample, we rank the diagnoses four times, ensuring that every ranking prompt permutation appears at least once, further reducing potential bias.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Quantitative Score Calculation</head><p>Once the diagnoses have been ranked, we calculate the aggregated score using the following approach: For each trace log L, the diagnosis result from each diagnosis tool T will be evaluated three times on diagnosis criterion C &#8712; {Accuracy, Utility, Interpretability} and assigned a Rank &#8712; <ref type="bibr">[1,</ref><ref type="bibr">4]</ref>. The score for each sample is calculated as S T,C,L = (4 -Rank T,C,L ). A lower numerical rank (e.g. Rank 1) indicates a higher or better score.</p><p>The total score over all samples for a given data source D &#8712; {Simple-Bench, IO500, Real-Applications} is therefore defined as the sum of S T,C,L :</p><p>We then normalize the score into a value between 0 and 1 by dividing it by the maximal score one can get:</p><p>Here (4-1) is the highest score for each trace in D. Based on NS T,C,D , we can define the average score of each tool across all three metrics over all data sources to show how the tool works across metrics. It is simply an average of NS T,C,D over all three metrics. Similarly, we also define the average score of each tool across all data sources on each metric to show how the tool works across different data sources. Note that, to minimize hallucination by the LLM during ranking, we require the model to verify the reasoning behind the assigned positions by including a rank assignment explanation as part of the prompt. This approach provides significantly more insight into the assigned scores, as demonstrated by the following example of an explanation provided by the ranking LLM:</p><p>Tool-2 and Tool-3 both provided comprehensive diagnoses that accurately identified all five I/O performance issues: small read and write I/O requests, misaligned read and write requests, and the use of multiple processes without MPI. Tool-2 provided detailed recommendations and emphasized the need to align with the file system boundaries, making it a strong contender for the top rank.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Ranking Results</head><p>Table <ref type="table">IV</ref> summarizes all the results. We present the results for each metric (i.e., accuracy, utility, and interpretability) as well as the average across all metrics in the rows. The results for each type of job trace in TraceBench, along with the overall average, are shown in the columns. Each individual row represents one diagnosis tool, including the naive LLMbased tool (ION) using gpt-4o (gpt-4o-2024-05-13) as its backbone and the state-of-the-art heuristic tool (Drishti). For IOAgent, we include two instances, labeled "IOAgent-*": one uses the proprietary gpt-4o (gpt-4o-2024-05-13) model from OpenAI, and the other uses the open-source Llama-3.1-70B-Instruct model from Meta. By including results from both instances, we want to evaluate that if IOAgent relies on specific proprietary models.</p><p>The results in Table <ref type="table">IV</ref> clearly show that both IOAgent tools perform exceptionally well across all metrics and job traces. As expected, IOAgent-gpt-4o performs particularly well, given that it uses a frontier-level LLM; however, it is noteworthy that IOAgent-llama-3.1-70B also performs strongly. Its results are close to those of the gpt-4o model and surpass those of the state-of-the-art heuristic tool (Drishti) and the naive LLM-based tool (ION). Interestingly, Llama IOAgent seems to excel in more primitive cases like Simple-Bench. This may be due to IOAgent-gpt-4o providing too many details in such basic cases, making its output less useful or understandable from the user's perspective. These results confirm that our carefully designed IOAgent workflow makes LLMs a stable and practical tool for assisting scientists in understanding their applications' I/O issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Continued User Interaction</head><p>A primary feature enabled by the LLM-centric design of IOAgent is the ability for users to gain a deeper understanding and receive application-specific recommendations with implementation guidance through continued chat-based interaction. To evaluate this feature, we present a real interaction case summarized in Figure <ref type="figure">5</ref>. In this example, IOAgent analyzed a log from the IO500 TraceBench subset, which performed a significant number of 4MB reads and writes but used the default Lustre stripe width and stripe size parameters of 1 and 1MB, respectively. The final diagnosis highlighted these potentially suboptimal stripe settings.</p><p>Following the diagnosis, the user simply prompted IOAgent about how to fix such an issue (highlighted in blue). In this case, IOAgent effectively utilized the context of the diagnosis and its referenced sources to provide detailed assistance, offering explanations of recommended actions and code samples, such as lfs setstripe -S 4M (highlighted in orange). These responses were not only accurate but also tailored to the specific issue identified by the application (highlighted in green). Such cases demonstrate IOAgent's unique potential to effectively assist domain scientists at scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Why Tree-based Merge?</head><p>The tree-based merge mechanism introduces both additional time and monetary overhead, as merging is done in pairs rather than all at once through a single prompt. In this evaluation, we aim to demonstrate the necessity of this overhead. Specifically, we present a comparison of using the tree-based merge versus not using it, as shown in Figure <ref type="figure">6</ref>. For this benchmark, we used a less capable open-source Llama-3-70B model to illustrate how direct merging can be problematic. Due to space constraints, we use the example of merging just four summary diagnoses: Size, Request Count, Metadata, and Request Order. At the bottom of Fig. <ref type="figure">6</ref>, we briefly list the issues identified with each category. We then compare the merged results of our tree-based approach with the naive one-step solution, using the same prompts.</p><p>The results of the 1-step merge show that important information regarding the non-sequential I/O patterns and insight into the stride sizes as well as the specific recommended use of higher level parallel I/O libraries (MPI-IO) are all lost, along with their respective domain reference sources. In contrast, the tree-based merge successfully maintains the key points of each individual summary diagnosis as well as the useful domain reference sources pertaining to each. Note that, this is just a simpler case where only four summaries need to be merged. In IOAgent, we typically need to deal with 13 summary diagnoses, which is extremely challenging even for the latest gpt-4o model based on our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSION AND FUTURE WORK</head><p>In this study, we propose and implement IOAgent, a model-agnostic LLM-based framework for HPC I/O trace log analysis which fills the existing gaps faced by nonaugmented language models while analyzing logs. IOAgent implements a module-based log pre-processor, a vector index for domain information retrieval, and a tree-based summarization strategy to enable accurate, context-aware, and userfriendly I/O performance analysis achieving results on-par with or superior to existing, expert-level tools as evaluated on multiple criteria. Leveraging the conversational strengths of language models further democratizes access to interactive and accurate I/O performance diagnoses. Our future work includes the expansion of IOAgent's existing capabilities to provide in-depth user interaction capabilities, such as building the continually advancing reasoning capabilities of LLMs and the integration of continual analysis improvements through interaction. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: UNIVERSITY OF DELAWARE LIBRARY. Downloaded on December 01,2025 at 20:52:00 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
