<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Task as Context Prompting for Accurate Medical Symptom Coding Using Large Language Models</title></titleStmt>
			<publicationStmt>
				<publisher>IEEEACM Conference on Connected Health Applications Systems and Engineering Technologies</publisher>
				<date>06/24/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10638874</idno>
					<idno type="doi"></idno>
					<title level='j'>IEEEACM Conference on Connected Health Applications Systems and Engineering Technologies</title>
<idno>2832-2967</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Chengyang He</author><author>Wenlong Zhang</author><author>Violet Xinying Chen</author><author>Yue Ning</author><author>Ping Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Accurate medical symptom coding from unstructured clinical text, such as vaccine safety reports, is a critical task with applications in pharmacovigilance and safety monitoring. Symptom coding, as tailored in this study, involves identifying and linking nuanced symptom mentions to standardized vocabularies like MedDRA, differentiating it from broader medical coding tasks. Traditional approaches to this task, which treat symptom extraction and linking as independent workflows, often fail to handle the variability and complexity of clinical narratives, especially for rare cases. Recent advancements in Large Language Models (LLMs) offer new opportunities but face challenges in achieving consistent performance. To address these issues, we propose Task as Context (TACO) Prompting, a novel framework that unifies extraction and linking tasks by embedding task-specific context into LLM prompts. Our study also introduces SYMPCODER, a human-annotated dataset derived from Vaccine Adverse Event Reporting System (VAERS) reports, and a two-stage evaluation framework to comprehensively assess both symptom linking and mention fidelity. Our comprehensive evaluation of multiple LLMs, including Llama2-chat, Jackalope-7b, GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o, demonstrates TACO's effectiveness in improving flexibility and accuracy for tailored tasks like symptom coding, paving the way for more specific coding tasks and advancing clinical text processing methodologies.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Accurate symptom coding from unstructured clinical text, particularly in the context of vaccine safety and pharmacovigilance, remains a critical yet complex task. Systems such as the Vaccine Adverse Event Reporting System (VAERS) <ref type="bibr">[8]</ref> play a vital role in monitoring potential adverse events following immunization on a global scale. However, the highly variable and informal nature of clinical narratives within these reports presents significant challenges for automatically extracting and linking symptoms to standardized medical vocabularies like Medical Dictionary for Regulatory Activities (MedDRA) <ref type="bibr">[7]</ref>.</p><p>In this work, we formulate symptom coding as a task of identifying and linking nuanced symptom mentions to standardized vocabularies, which is different from broader medical coding tasks such as International Classification of Diseases (ICD) coding. While ICD coding primarily addresses diagnoses and procedures, symptom coding emphasizes the extraction of subjective experiences (e.g., "dizziness" or "rash") and their precise mapping to a predefined set of codes. This tailored task is crucial for downstream applications such as pharmacovigilance, safety monitoring, and trend analysis, offering a more flexible and specific approach compared to traditional medical coding practices.</p><p>Traditional methods for medical symptom coding typically separate symptom extraction and linking into independent workflows. Earlier approaches relied on rule-based systems or contextual models, which often struggled to capture the full context of clinical narratives and were prone to errors, especially for rare or ambiguous symptoms <ref type="bibr">[15,</ref><ref type="bibr">19,</ref><ref type="bibr">27]</ref>. The disjointed nature of these processes introduced inefficiencies and inconsistencies, limiting their effectiveness in complex cases. These limitations are particularly pronounced in tasks like symptom coding, where ensuring reliable mappings of symptom mentions to standardized vocabularies is critical.</p><p>Recent advancements in Large Language Model (LLMs) have shown promise in addressing these challenges, offering enhanced contextual understanding and nuanced language processing capabilities <ref type="bibr">[28]</ref>. By integrating extraction and linking tasks into a single process, LLMs have the potential to overcome the limitations of traditional approaches. However, ensuring consistent accuracy across diverse cases-especially rare or complex adverse events-remains an open problem <ref type="bibr">[15]</ref>. The need for a unified framework that preserves the relationships between symptoms and their corresponding codes while maintaining efficiency is paramount.</p><p>To address these challenges, we propose the Task as Context Prompting (TACO), a novel approach leveraging LLMs to perform symptom extraction and linking as interdependent tasks. Unlike traditional methods that treat these processes in isolation <ref type="bibr">[19]</ref>, TACO embeds task-specific context directly within LLM prompts, enabling the model to understand and maintain relationships between symptoms and standardized codes throughout the process. By unifying extraction and linking, TACO reduces information loss and enhances the flexibility and accuracy of symptom coding. Additionally, to the best of our knowledge, there is no existing dataset specifically designed for symptom coding in the medical domain. To support the training and evaluation of models for medical symptom coding, we introduce SYMPCODER, a human-annotated dataset derived from VAERS reports. This dataset encompasses three variants, SYMPCODER-Full, SYMPCODER-Common-50, and SYMPCODER-Rare-50, providing a comprehensive benchmark for assessing model performance across diverse cases, including both frequently reported and rare adverse events. By offering detailed annotations for both symptom extraction and linking, SYMPCODER enables systematic evaluation of models and facilitates further advancements in medical symptom coding research.</p><p>Our study further introduces a two-stage evaluation framework. The Linking Integrity and Knowledge (LINK) stage evaluates the accuracy of linking extracted symptoms to standardized codes, while the Mention Accuracy and Textual Coherence (MATCH) stage focuses on the fidelity and coherence of the original mentions extracted from clinical narratives. This dual-phase evaluation framework provides a granular analysis of model performance across both common and rare cases, offering comprehensive insights into their capabilities. Figure <ref type="figure">1</ref> illustrates our TACO prompting framework, detailing the workflow from input processing to final evaluation. In summary, our study introduces the following contributions:</p><p>&#8226; SYMPCODER Dataset: A human-annotated dataset based on VAERS. This dataset includes three variants and provides a benchmark to evaluate the capabilities of various methods for symptom coding, including LLMs, across diverse cases, ranging from common to rare adverse events. &#8226; TACO Prompting Framework: A novel approach that unifies symptom extraction and linking by embedding task-specific context within prompts, improving flexibility and accuracy. &#8226; Comprehensive Two-Stage Evaluation: A dual-phase framework (LINK and MATCH) that assesses both the linking accuracy of extracted symptoms and the coherence of their original mentions, providing a detailed understanding of model capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK 2.1 Entity Extraction and Linking</head><p>Entity extraction and linking are fundamental tasks in natural language processing (NLP) that facilitate the transformation of unstructured text into structured, actionable data. These tasks are widely used in domains such as general text processing, biomedical information retrieval, and clinical text mining <ref type="bibr">[17]</ref>. Entity extraction focuses on identifying relevant entities (e.g., people, places, symptoms) in text. Earlier approaches to this task relied on rule-based systems and statistical models, such as Hidden Markov Models (HMMs) <ref type="bibr">[3]</ref> and Conditional Random Fields (CRFs) <ref type="bibr">[33]</ref>. While these models were effective in specific scenarios, they struggled with generalizing across diverse contexts and handling the inherent variability of natural language <ref type="bibr">[21]</ref>. The introduction of deep learning, particularly Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, improved the ability to capture sequential dependencies and contextual nuances <ref type="bibr">[1]</ref>. Recently, transformer-based models such as BERT <ref type="bibr">[11]</ref> have revolutionized entity extraction by leveraging self-attention mechanisms, enabling the capture of long-range dependencies and complex contextual relationships <ref type="bibr">[34]</ref>.</p><p>Entity linking complements extraction by resolving ambiguities and connecting extracted entities to predefined ontologies or knowledge bases such as Wikipedia <ref type="bibr">[12]</ref>, DBpedia <ref type="bibr">[2]</ref>, or medical databases like SNOMED <ref type="bibr">[10]</ref> and UMLS <ref type="bibr">[4]</ref>. Traditional methods typically employed a two-step approach: extracting handcrafted features and using entity-ranking models for linking predictions. These methods were limited by their reliance on labor-intensive feature engineering and poor generalizability across different KBs and domains <ref type="bibr">[37]</ref>. Modern approaches such as embedding-based method, powered by models like BERT, have significantly enhanced linking by mapping textual entities and knowledge base entries into a shared vector space <ref type="bibr">[6]</ref>. Despite advancements, embedding-based methods still face challenges such as reliance on large annotated datasets, difficulty generalizing to unseen entities or domains, and computational inefficiency at scale, while often struggling with nuanced contextual relationships in specialized domains <ref type="bibr">[37]</ref>.</p><p>In the biomedical domain, extraction and linking tasks are often addressed independently, overlooking their inherent interdependencies and potential benefits of integrating them. However, combining both extraction and linking tasks is critical for applications like ICD coding and adverse event detection, where seamless integration not only enhances the accuracy but also improves the overall efficiency by providing essential contextual information to each other <ref type="bibr">[13,</ref><ref type="bibr">20]</ref>. BioBERT-based Named Entity Recognition and Normalization (BERN) <ref type="bibr">[22]</ref> marked a milestone in this field by unifying entity extraction and linking into a single, streamlined framework. Leveraging a fine-tuned version of BioBERT <ref type="bibr">[23]</ref>, BERN integrates these interdependent tasks and eliminates the need for separate modules, achieving state-of-the-art performance at the time. This integrated approach laid the groundwork for further advancements in medical coding. However, significant gaps and challenges remain, highlighting the need for further investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Advances in Contextual Models and LLMs</head><p>Contextual language models like BERT and its domain-specific adaptations, such as BioBERT and ClinicalBERT <ref type="bibr">[16]</ref>, have significantly improved tasks involving complex biomedical text. By incorporating domain-specific pretraining on large biomedical corpora, these models enhanced the accuracy of tasks like symptom extraction, diagnosis identification, and treatment mapping <ref type="bibr">[18,</ref><ref type="bibr">28]</ref>. Despite these advances, they continued to treat extraction and linking as distinct tasks, requiring additional post-processing steps that often introduced errors, particularly for ambiguous or infrequent cases <ref type="bibr">[19]</ref>. Moving beyond these limitations, CNN-based architectures such as MultiResCNN <ref type="bibr">[24]</ref> further advanced medical coding by incorporating residual blocks and multi-resolution filters to capture complex patterns in clinical notes. However, these models required explicit alignment of input features with predefined codes, limiting their adaptability to unseen data or emerging clinical terms.</p><p>The emergence of large language models (LLMs) like GPT-3 and GPT-4 has transformed NLP, enabling the integration of extraction and linking tasks within a single framework. For example, LLM-Codex <ref type="bibr">[40]</ref> utilized a two-stage pipeline to improve the reliability of ICD coding predictions, with an LLM in the first stage and a verifier model in the second stage. Similarly, Boyle et al. <ref type="bibr">[5]</ref> proposed a zero-shot and few-shot ICD coding method using pre-trained LLMs, framing the task as an information extraction problem and leveraging the hierarchical structure of the ICD ontology for efficient code assignment. While these approaches demonstrated state-of-the-art performance, their reliance on predefined hierarchies and focus on ICD coding distinguishes them from our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">THE SYMPCODER DATASET CREATION 3.1 VAERS Dataset</head><p>The Vaccine Adverse Event Reporting System (VAERS) dataset, managed by the Centers for Disease Control and Prevention (CDC) and the Food and Drug Administration (FDA), is a critical resource for monitoring vaccine safety. VAERS operates as a passive reporting system where individuals, including healthcare professionals and the general public, can submit reports of adverse events following immunization. Healthcare professionals must report certain adverse events, and vaccine manufacturers must report all adverse events they become aware of. VAERS does not determine causality but helps detect unusual or unexpected patterns in adverse event reports that may suggest safety issues, aiding the CDC and FDA in identifying areas that need more evaluation and assessment. Since 1990, VAERS has collected millions of reports, each containing detailed information such as demographics, vaccination details, descriptions of adverse events, standard symptom lists, and medical history <ref type="bibr">[8]</ref>. This extensive dataset serves as a valuable resource for medical symptom coding, facilitating ongoing research and analysis to enhance our understanding of vaccine safety and address emerging concerns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Entity Mention Annotation</head><p>We randomly selected 500 VAERS reports (1990-2023) and assembled a team of three annotators plus one validator to ensure accuracy. Each report included suggested symptom terms (MedDRAencoded), and our annotation process marked all symptom mentions related to vaccine adverse events. This approach captures the variability of symptom expression and yields a robust dataset for model training and evaluation. The annotation process consists of two key phases, including annotation and validation, designed to ensure both thoroughness and accuracy. 3.2.1 Annotation Phase. The annotation phase was meticulously carried out by three graduate students selected for their academic background, annotation skills, English proficiency, and basic clinical knowledge, ensuring they could handle the linguistic and domainspecific nuances required for accurate annotation. Each annotator was responsible for annotating approximately one-third of the total 500 reports for annotations by following a clearly defined set of guidelines to ensure consistency and reliability across all reports.</p><p>During the annotation, (1) the annotators used a home-maintained data annotation tool designed specifically for this task. This tool ensured a seamless workflow, focusing solely on adverse events while disregarding procedural details and negative test results to eliminate any irrelevant information. (2) Each report was annotated one label at a time, using a list of suggested terms for symptoms provided in the original VAERS data, which are encoded according to the MedDRA terminology. This step-by-step process allowed the annotators to focus on one potential adverse event at a time, ensuring accuracy and precision. (3) In cases where annotators encountered uncertainties or ambiguities in the text, they marked the annotation as "uncertain" for further validation. This precautionary measure allowed a second layer of scrutiny to improve the reliability of the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Validation</head><p>Phase. The validation phase was conducted by a senior annotator specializing in natural language processing for the medical domain, selected for his or her extensive annotation experience.</p><p>(1) The validator meticulously reviewed all annotations, especially those marked as "uncertain", to ensure flagged annotations received additional scrutiny and maintained high data integrity.</p><p>(2) For each uncertain annotation, the validator engaged in discussions with the original annotator to understand their reasoning and reach a consensus. Persistent ambiguities were resolved collaboratively within the team to ensure alignment with the annotation guidelines. (3) When discrepancies or inconsistencies were identified, the validator suggested revisions, which were reviewed and discussed with the annotators. Final decisions on adjustments were made collaboratively to ensure consistency and transparency throughout the validation process. The resulting human-annotated dataset, SYMPCODER, stands as a fundamental benchmark for medical symptom coding tasks. Beyond its utility in our study, this dataset offers valuable insights for advancing research in this field and exploring the capabilities of various methods for symptom coding in future investigations. This rigorous annotation and validation process ensured high data integrity and reliability, making SYMPCODER a robust resource for evaluating the performance of LLMs in the symptom coding task presented in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3">SYMPCODER Dataset.</head><p>We constructed three subsets of the annotated dataset: SYMPCODER-Full, SYMPCODER-Common-50, and SYMPCODER-Rare-50 to facilitate focused analyses. These subsets were created to evaluate model performance across symptoms with different frequencies, targeting variations between common cases and rare or ambiguous cases. Additional details about the dataset and project can be found in <ref type="url">https://github.com/LEAF-Lab- Stevens/TACO-Prompting</ref>.</p><p>&#8226; SYMPCODER-Full: This comprehensive dataset includes all annotated symptom mentions from the VAERS reports, providing a holistic benchmark for symptom extraction and linking tasks. &#8226; SYMPCODER-Common-50: This subset focuses on the 50 most frequently mentioned symptoms in the dataset. Symptoms were ranked by their frequency of occurrence within SYMPCODER-Full. The top 50 symptoms were selected, and reports containing at least one of these symptoms were included in this subset. &#8226; SYMPCODER-Rare-50: This subset highlights the 50 least frequently mentioned symptoms. Following a similar process to the Common-50 subset, symptoms were ranked by frequency, and reports containing at least one of these rare symptoms were included in this subset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Basic Statistics of SYMPCODER</head><p>Of the initial 500 annotations, 487 distinct reports remained after removing 13 uncertain results, which annotators and validator could not verify as symptoms due to limited domain knowledge or complex contexts. The SYMPCODER datasets, comprising SYMPCODER-Full, SYMPCODER-Common-50, and SYMPCODER-Rare-50, provide a robust foundation for further studies. As shown in Table <ref type="table">1</ref>, the SYMPCODER-Full dataset offers a comprehensive resource, capturing diverse adverse event descriptions with varying lengths and complexities. Table <ref type="table">1</ref> provides additional details, such as the average and median lengths of clinical text, suggested symptoms, and extracted symptoms. These metrics reveal consistent gaps between human-labeled and model-generated symptoms across all subsets, highlighting the challenges of achieving complete extraction. By focusing on common and rare subsets, the SYMPCODER-Common-50 and SYMPCODER-Rare-50 datasets provide targeted benchmarks for evaluating model performance across both frequent and infrequent adverse events. Figure <ref type="figure">2a</ref> shows that most reports in the SYMPCODER-Full dataset contain between 0 and 10 symptoms, with a sharp decline for higher counts. A similar trend is observed in Figure <ref type="figure">2b</ref> for the SYMPCODER-Common-50 dataset. Figure <ref type="figure">2c</ref> highlights a steeper drop after 10 symptoms in the SYMPCODER-Rare-50 dataset, with Prompting Method Structure TASI TACO Clinical Text {'Headaches for 2 weeks tired for 2 weeks sinus congestion for 2 weeks left side of throat sore month.',} Prompt Header First, extract a symptom list including all symptoms from the clinical text above Then, extract the symptoms that indicating each of the suggested terms below from the symptom list in previous step. Extract the terms mentioned in the clinical text above that indicating each of the following terms Prompt Body</p><p>Include the terms in the output even if the terms are not explicitly mentioned in the provided report, just provide 'none' as the result.</p><p>Please follow the order of this list: {suggested symptom list} and generate output by following the requirements below: Requirements: 1. (Define adverse event symptom) 2. (Return "none" for non-symptom terms) 3. (Use JSON as shown in the example) Output Instruction {{"Symptom1": ["redness", "redness"], "Symptom2": ["fever"], "Symptom3": ["sore", "arm", "soreness"], "Symptom4":</p><p>["heartfailure"]} {"Erythema": ["redness", "redness"], "Pain in extremity":</p><p>["sore", "arm", "soreness"], "Pruritus": ["none"]}} {{"Erythema": ["redness", "redness"], "Pain in extremity":</p><p>["sore", "arm", "soreness"], "Pruritus": ["none"]}} a few outliers reaching up to 40 symptoms, reflexting the rarity and complexity of such cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">THE PROPOSED TACO PROMPTING</head><p>This paper evaluates the efficacy of various LLMs in extracting vaccine adverse event symptoms from VAERS reports and linking to formal MedDRA codes in a predefined list. To address this task, we designed Task as Context (TACO) prompting, a novel prompting method that integrates symptom extraction and linking into a unified framework. By embedding the interrelated tasks within a single prompt, TACO leverages the broader task context to enhance model performance and adaptability for automated symptom coding. To benchmark its effectiveness, we also designed Task as Sequential (TASI), which separates both tasks into distinct phases, allowing for a comparative analysis of the two approaches. Throught this comparison, our study highlights how embedding interrelated tasks, as proposed by the Task as Context strategy, enhances the performance and adaptability of LLMs for automated medical symptom coding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Task as Context Prompting</head><p>4.1.1 Inspiration. The Task as Context concept, originally designed for annotation workflows with non-experts involving interdependent tasks, emphasizes leveraging contextual relationships to improve accuracy and adaptability. This strategy integrates related tasks into a single process, reducing inefficiencies and error propagation <ref type="bibr">[25]</ref>. In the context of symptom coding, two highly interdependent tasks, symptom extraction and linking symptoms to standardized codes, are traditionally solved independently. However, this modular approach suffers from limitations, including the loss of contextual cues between tasks, error propagation from one task to another, and scalability challenges due to the need for separate training processes.</p><p>Inspired by the Task as Context strategy, we address these limitations by designing TACO prompting, which explicitly embeds the interdependence of extraction and linking into a unified prompting framework. TACO integrates the tasks within the prompt, enabling simultaneous learning and execution. By treating the two tasks holistically, our approach improves contextual understanding, minimizes error propagation, and enhances both accuracy and scalability, providing a streamlined and adaptive solution for symptom coding challenges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">TACO Prompting Design.</head><p>To clearly present the structure of TACO prompting, we outline its four core components in Figure <ref type="figure">3</ref>.</p><p>&#8226; Clinical Text: The clinical text serves as the raw, unstructured input data from the VAERS reports. It mirrors real-world scenarios where models must process free-text narratives to extract and link relevant symptoms. This component provides the context required for understanding the input data and serves as the foundation for subsequent tasks. &#8226; Prompt Header: The prompt header provides explicit instructions for task execution. In TACO, this component integrates both symptom extraction and linking tasks into a single step. It instructs the model to directly extract symptoms from the clinical text and map them to the provided suggested terms, leveraging the broader task context. In contrast, the benchmark prompt TASI separates these tasks into two distinct phases, requiring the model to first extract all symptoms and subsequently link them to suggested terms. &#8226; Prompt Body: The prompt body contains detailed guidelines for task execution and ensures that both tasks are carried out as intended. For TACO, the instructions guide the model to simultaneously address symptom extraction and linking, thereby reducing redundancy and improving efficiency. TASI, however, employs sequential instructions, where symptom extraction is followed by linking. This distinction reflects the core advantage of TACO, which embeds the broader task context into a cohesive framework to improve task understanding and execution. &#8226; Output Instruction: The output instruction specifies the expected output format to ensure consistency and clarity. Both TACO and TASI utilize a structured JSON format; however, their approaches differ. In TACO, the format unifies the output by directly mapping symptoms to suggested terms, aligning with its integrated task-solving approach. In TASI, the output separates extraction and linking results, reflecting its sequential methodology. The inclusion of illustrative examples further clarifies the expected output structure for the model.</p><p>To comprehensively evaluate the effectiveness of TACO, we compare it against the benchmark prompt TASI, which adheres to traditional sequential task-solving methodologies. While TASI serves as a baseline, TACO's integrated and context-rich design seeks to overcome the limitations of traditional approaches by streamlining task execution and enhancing contextual understanding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">Model Output Distillation and Evaluation Tasks.</head><p>To address inconsistencies in model outputs, we incorporate a postprocessing pipeline during the model output distillation phase, leveraging Regular Expression (regex) <ref type="bibr">[38]</ref> syntax to systematically capture the information needed. These inconsistencies include incomplete model outputs, unnecessary descriptive text at the beginning of responses, and nonsensical outputs that deviate from task requirements. The distillation process ensures uniformity and precise extraction of relevant information from the varied responses generated by LLMs. As shown in Figure <ref type="figure">1</ref>, the TACO outputs undergo this refinement step, eliminating irrelevant details and preparing them for structured evaluation.</p><p>The evaluation phase comprises two stages. In the first stage, known as Linking Integrity and Knowledge (LINK), we assess the models based on how accurately they link extracted symptoms to the corresponding terms from the Suggested List. In this stage, as Figure <ref type="figure">1</ref> shows, the model must first identify the correct symptoms (e.g., "injection site erythema," "vomiting") from the clinical text using the distilled output from the previous phase. The focus here is on ensuring that each relevant symptom has been accurately identified and linked to the correct terminology. This stage evaluates the model's ability to extract and link adverse events, providing insights into how well it handles clinical information within the context of symptom extraction.</p><p>In the second stage, termed Mention Accuracy and Textual Coherence (MATCH), the focus shifts to the quality of the original mentions generated by the model for each linked symptom. As depicted in Figure <ref type="figure">1</ref>, this involves comparing the generated model results to the human-annotated gold standard in terms of specificity and precision. For example, while LINK verifies if the model has correctly linked "injection site erythema," MATCH evaluates whether the model-generated mention-such as "redness at the injection site"-is semantically similar to the original clinical report, which is annotated simply as "redness. " This two-tiered evaluation provides a deeper analysis of how well the models not only identify but retain the original context and details from the clinical text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Benchmark LLMs</head><p>We aim to harness the capabilities of several cutting-edge Large Language Models (LLMs) for the extraction and linking of adverse events on the SYMPCODER dataset. Each model offers unique strengths and characteristics, making them suitable for various aspects of this complex task.</p><p>&#8226; Jackalope-7b <ref type="bibr">[26]</ref>: The smallest open-sourced model in our investigation, Jackalope-7b is fine-tuned by SlimOrca using several open datasets on top of Mistral 7B. Despite its smaller size, Jackalope-7b is optimized for specific NLP tasks and demonstrates a balanced trade-off between performance and computational efficiency. Its architecture allows for quicker adjustments and fine-tuning, making it particularly effective in scenarios requiring high precision within a smaller model footprint. &#8226; Llama2-13b-chat <ref type="bibr">[39]</ref>: Developed by Meta AI, Llama2-13b-chat is a medium-sized model based on the widely recognized Llama2. Known for its advanced natural language understanding capabilities, Llama2-13b-chat is designed to handle complex language tasks with a focus on generating human-like responses. Its medium size strikes a balance between computational demand and performance, making it a versatile choice for a variety of NLP applications. &#8226; GPT-3.5-Turbo [29]: An efficient variant of OpenAI's GPT-3, GPT-3.5-Turbo is designed to excel in natural language processing tasks with optimized performance. It leverages the strengths of GPT-3 while incorporating enhancements that improve response coherence, accuracy, and efficiency. This model is well-suited for applications requiring robust language comprehension and generation capabilities. &#8226; GPT-4-Turbo [30]: An enhanced version of GPT-3.5-Turbo, GPT-4-Turbo offers comprehensive improvements in terms of model architecture, training data diversity, and computational efficiency.</p><p>It is designed to handle more complex language tasks with greater accuracy and faster response times. GPT-4-Turbo's advancements make it a powerful tool for sophisticated NLP applications, including detailed extraction and linking tasks. &#8226; GPT-4o <ref type="bibr">[31]</ref>: A faster and more cost-effective variant of GPT-4-Turbo, GPT-4o provides efficient performance at a reduced cost. It retains many of the advanced features of GPT-4-Turbo while optimizing for speed and resource utilization. GPT-4o is particularly advantageous in environments where computational resources are limited, but high performance is still required. Its ability to balance cost and efficiency makes it a practical choice for extensive NLP tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>Based on the SYMPCODER dataset, three research questions (RQs) are studied in this paper, including (1) RQ1: How do the TACO and TASI prompting strategies impact the performance of LLMs in the symptom coding task? (2) RQ2: How do different LLMs perform on the symptom coding task when evaluated with the same prompt design? (3) RQ3: How do LLMs perform on the top 50 common and rare symptom datasets, and how does performance vary across these cases?</p><p>5.1 Experimental Setting 5.1.1 Dataset Description. In the evaluation, we utilized the SYM-PCODER dataset, comprising SYMPCODER-Full, SYMPCODER-Common-50, and SYMPCODER-Rare-50, as introduced in Section 3. These subsets enable comprehensive evaluation of LLM performance across both frequently and infrequently observed symptoms, providing insights into model robustness and adaptability. More details on the dataset and its creation can be found in Section 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Evaluation Metrics.</head><p>The evaluation metrics used to assess model performance are structured according to the two stages of the evaluation process defined in Section 4.1.3. Across both stages, we employ the following metrics: Exact Match (EM) <ref type="bibr">[36]</ref>, Fuzzy Match <ref type="bibr">[9]</ref>, Precision <ref type="bibr">[14]</ref>, Recall <ref type="bibr">[14]</ref>, BLEU Score <ref type="bibr">[32]</ref>, and Cosine Similarity <ref type="bibr">[35]</ref>. Each metric provides a unique perspective on the models' performance in extracting and linking adverse event symptoms and their original mentions.</p><p>In the first stage (LINK), the evaluation focuses on term linking, with Exact Match (EM), Fuzzy Match, Precision, and Recall serving as the primary metrics. EM measures accuracy by verifying if the extracted terms exactly match the human-annotated terms. Fuzzy Match accounts for minor discrepancies by allowing approximate matches between terms. The combined EM and Fuzzy Match approach is applied ultimately, where Exact Match is followed by Fuzzy Match for unmatched terms, ensuring a more comprehensive and accurate evaluation. Precision and recall metrics quantify the accuracy and completeness of term linking, providing insights into the models' ability to capture and associate relevant terms with their respective medical codes.</p><p>In the second stage (MATCH), the evaluation centers on the quality of the original mentions generated by the models. we assess mention quality using BLEU, Fuzzy Match, and Cosine Similaritymeasuring n-gram precision, tolerating minor wording differences, and evaluating semantic coherence via vector embeddings, respectively. The embeddings for cosine similarity are derived from Ope-nAI Embedding models, ensuring robust semantic comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.3">Implementation Details.</head><p>In our model implementation, we utilized the Jackalope-7B and Llama2-13B architectures, employing models sourced from The Bloke via Huggingface. Additionally, for GPT models, we accessed OpenAI APIs to obtain embeddings and inference results. To regulate the output length, we configured the parameter max_new_token to a value of 256, while adjusting the temperature within the range of 0.3 to 0.5 to optimize performance. Our experiments were conducted on hardware comprising two Nvidia RTX A5000 GPUs with 298.5GB disk space and 104GB RAM. The processing time for each model, spanning from obtaining inference results to generating evaluations, remained within 2 hours, ensuring timely and efficient execution. An interesting observation is the relative performance of Jackalope-7b, which surpasses Llama2-13b in the LINK stage under TASI. This unexpected trend could stem from Jackalope-7b's simpler architecture and focused training, which might align better with the structured nature of TASI prompts. However, this advantage diminishes in the TACO scenario, where the added contextual complexity benefits more advanced models like GPT-4-Turbo and GPT-4o. This suggests that larger and more advanced LLMs are better equipped to handle the challenges of contextual integration inherent in TACO prompting.</p><p>Overall, the results demonstrate that GPT-4-Turbo and GPT-4o are the most effective models for vaccine adverse symptom coding. Their advanced architectures and contextual capabilities enable them to achieve consistently high performance across both evaluation stages, making them the most reliable choices for addressing the complexities of this task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3">Common Rare Symptoms Analysis (RQ3).</head><p>To evaluate the performance of LLMs on datasets containing the top 50 most common and bottom 50 least common symptoms, we conducted analyses using both TASI and TACO prompting methods. Figures <ref type="figure">4</ref> and <ref type="figure">5</ref> provide insights into how models handle frequent versus infrequent adverse symptoms, with a focus on model performance under each prompting strategy before comparing the methods.</p><p>Under TACO prompting, as shown in Figure <ref type="figure">5</ref>, models like GPT-4o and GPT-4-Turbo consistently exhibit higher precision and recall for the top 50 most common symptoms (Figure <ref type="figure">4a</ref>). This demonstrates their ability to accurately and comprehensively capture frequent symptoms due to their advanced architectures and extensive training. However, for the bottom 50 least common symptoms (Figure <ref type="figure">4b</ref>), performance varies more significantly. GPT-4-Turbo maintains relatively robust recall and precision, while GPT-4o and Llama2-13b-chat experience a noticeable decline in both metrics. This decrease suggests that rare symptoms, despite their distinctiveness, are challenging to extract consistently under TACO prompting, likely due to their sparse representation in training data. The similarity scores (Figure <ref type="figure">4c</ref>) further support these observations, with GPT-4o and GPT-4-Turbo achieving the highest semantic accuracy for both common and rare symptoms, while smaller models like Llama2-13b-chat perform less effectively.</p><p>Under TASI prompting, as depicted in Figure <ref type="figure">4</ref>, similar trends emerge, but the recall values for rare symptoms exhibit a sharper decline compared to TACO prompting (Figure <ref type="figure">5b</ref>). For the top 50 most common symptoms (Figure <ref type="figure">5a</ref>), all models maintain high precision, with GPT-4o and GPT-4-Turbo again leading in recall. However, for the bottom 50 rare symptoms, recall values decrease more dramatically for models like Llama2-13b-chat and GPT-4o, indicating that TASI's sequential structure is less effective at capturing rare terms. Precision scores for rare symptoms also decline under TASI, although not as significantly as recall. The similarity scores (Figure <ref type="figure">5c</ref>) show consistent results, with GPT-4o and GPT-4-Turbo outperforming smaller models, particularly in the rare case analysis.</p><p>When comparing the two prompting methods, TACO consistently demonstrates advantages over TASI for capturing common symptoms, with higher recall and precision for most models. For rare symptoms, TACO prompting provides better consistency in recall and precision across advanced models like GPT-4-Turbo and GPT-4o, although challenges remain for smaller models. The integrated contextual design of TACO prompting likely allows models to handle both frequent and infrequent patterns more effectively, while TASI's sequential structure introduces limitations, particularly for rare cases.</p><p>Overall, the comparison reveals that advanced models such as GPT-4o and GPT-4-Turbo consistently outperform smaller models under both prompting methods. However, TACO prompting offers clear advantages in handling the variability of both common and rare adverse symptoms, highlighting its effectiveness as a strategy for vaccine adverse symptom coding tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.2.4</head><p>Case Study: Common vs Rare Mention. Table <ref type="table">4</ref> evaluates the ability of GPT-4 Turbo to capture common and rare terms in clinical text. The selected terms-"Eye Irritation," "Facial Spasm," "Blister, " "Fatigue, " "Pyrexia, " and "Dizziness"-are derived from the top 50 (common) and bottom 50 (rare) terms in the dataset. This selection provides a balanced analysis of the model's performance across varying levels of term frequency.</p><p>Table <ref type="table">5</ref>: Illustrative Example of Model Prediction Comparison. This table compares the outputs of various models against the gold labels for a selected VAERS report, illustrating the models' abilities to identify and link symptoms from the given text description to standardized terms. Text in the same color refers to the same symptom, while red text indicates incorrect predictions compared to the gold labels.</p><p>Text Descriptions 24 hours after vaccines had temp of 103.2 degrees. Gave Tylenol. On Tues., lack of appetite. On Wed, blotchy rash on face &amp; body. On Thurs, rash lessened. On Fri, no rash. Fever lasted about 24 hours. Has had cold for approx. 3 weeks/runny nose, occasional fever). Directed to inform physician of vaccines and events.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Gold Labels</head><p>Pyrexia: [Fever], Decreased appetite: [lack, of, appetite], Rhinorrhoea: [weeks/runny, nose], Rash: [blotchy, rash, on, face, &amp; body], Rash macular: [blotchy, rash] Model Output Jackalope-7b Pyrexia: [fever, temp of 103.2 degrees, lack of appetite, on thursday, no rash, on friday], Decreased appetite: [on tuesday, on wednesday, on friday], Rhinorrhoea: [cold for approx. 3 weeks/runny nose], Rash: [blotchy rash on face &amp; body, on wednesday, on thursday, on friday], Rash macular: [rash on face &amp; body, rash on wednesday, rash on thursday, rash on friday] Llama2-chat-13b Pyrexia: [fever], Decreased appetite: [lack of appetite], Rhinorrhoea: [none], Rash: [blotchy rash, rash lessened, no rash], Rash macular: [none] GPT-3.5-Turbo Pyrexia: [temp, fever], Decreased appetite: [lack of appetite], Rhinorrhoea: [cold, runny nose], Rash: [blotchy rash], Rash macular: [none] GPT-4-Turbo Pyrexia: [temp of 103.2 degrees, Fever lasted about 24 hours], Decreased appetite: [lack of appetite], Rhinorrhoea: [runny nose], Rash: [blotchy rash on face &amp; body], Rash macular: [none] GPT-4o Pyrexia: [temp of 103.2 degrees, fever lasted about 24 hours, occasional fever], Decreased appetite: [lack of appetite], Rhinorrhoea: [runny nose],Rash: [blotchy rash on face &amp; body, rash lessened, no rash], Rash macular: [none]</p><p>For common terms like "Fatigue" and "Dizziness, " GPT-4 Turbo enriched its outputs with additional contextual details, such as "wiped out" for "Fatigue" and "wobbly legs" for "Dizziness. " While this demonstrates the model's ability to infer related terms, the overgeneralization occasionally introduced extraneous phrases that deviated from the original clinical mention. In the case of "Pyrexia, " the model generated descriptors such as "low-grade temp" and "temp of 103 degrees, " which, while adding specificity, risked fragmenting data due to overly detailed outputs.</p><p>Rare terms, on the other hand, posed greater challenges. For example, "Eye Irritation" was simplified to "burning eyes, " reflecting a preference for informal phrasing over formal clinical terminology. Similarly, terms like "Facial Spasm" and "Blister" were extracted correctly but lacked variation, suggesting limited diversity in the model's outputs for rare terms. These issues may stem from the sparse representation of rare cases in the training data, causing the model to favor familiar informal phrases and rely on heuristics that prioritize common terms over rare, specific ones.</p><p>In summary, GPT-4 Turbo performs well with common terms but struggles to preserve formal phrasing for rare ones. This reliance on sparse data underscores the need to refine prompts and add post-processing to better handle underrepresented terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.2.5</head><p>Case Study: Mention Discrepancy. Table <ref type="table">5</ref> compares outputs from five models(Jackalope-7b, Llama2-chat-13b, GPT-3.5-Turbo, GPT-4 Turbo, and GPT-4o) with gold-standard annotations, revealing significant differences in capturing nuanced and rare mentions.</p><p>Advanced models like GPT-4 Turbo and GPT-4o consistently provided detailed and contextually enriched outputs. For instance, GPT-4o added specific details such as "occasional fever, " closely aligning with the gold standard while offering richer context. Jackalope-7b, on the other hand, tended to over-annotate by introducing temporal phrases like "on Wednesday" and "on Friday, " which, while detailed, often introduced unnecessary noise. Llama2-chat-13b struggled significantly, failing to identify certain mentions, particularly less frequent ones like "Rash Macular," highlighting its limitations in capturing rare or complex terms.</p><p>For common terms like 'Decreased Appetite, ' all models except Jackalope-7b closely matched the gold standard. However, performance on rare terms like 'Rash Macular' varied: Jackalope-7b offered multiple details, while Llama2-chat-13b and GPT-3.5-Turbo often returned 'none. '</p><p>This analysis reinforces a clear trend: higher-capacity models generally excel in capturing nuanced terms, whereas smaller models struggle with rare and complex ones. Even the most advanced models miss infrequent terms occasionally, highlighting an ongoing need for refined prompting strategies and post-processing pipelines to ensure clinical relevance and consistency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSIONS</head><p>In this paper, we addressed the challenges of accurate symptom coding from unstructured clinical data, particularly focusing on the nuanced extraction and linking of symptoms in vaccine safety reports such as those in VAERS. To benchmark the performance of LLMs on this tailored task, we introduced SYMPCODER, a human-annotated dataset that captures both common and rare symptoms, providing a robust foundation for evaluating adverse event extraction and linking tasks. Our proposed TACO prompting framework unifies symptom extraction and linking into a single process, significantly improving contextual accuracy and reducing information loss typically associated with traditional separate workflows. In addition, we introduced a two-stage evaluation framework, comprising the LINK and MATCH phases, which enables a detailed assessment of model performance by focusing on both symptom linking and original mention matching. This framework revealed key differences in how models handle common versus rare symptoms, offering a nuanced understanding of LLM capabilities. Through extensive experiments with several state-of-the-art LLMs, we demonstrated the superior performance of TACO, emphasizing the impact of task-specific prompt design on model accuracy. Our findings not only highlight the effectiveness of TACO prompting in enhancing model performance but also underscore its potential as a flexible framework for developing tailored coding tasks in clinical natural language processing. Future work will explore broader applications of TACO, aiming to develop more innovative and impactful solutions in the medical field.</p></div></body>
		</text>
</TEI>
