skip to main content

Search for: All records

Creators/Authors contains: "Sun, Jimeng"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    In the U.S. inpatient payment system, the Diagnosis-Related Group (DRG) is pivotal, but its assignment process is inefficient. The study introduces , an advanced large language model (LLM) fine-tuned on clinical notes to enhance DRGs assignment. Utilizing LLaMA as the foundational model and optimizing it through Low-Rank Adaptation (LoRA) on 236,192 MIMIC-IV discharge summaries, our -7B model exhibited a noteworthy macro-averaged F1 score of 0.327, a top-1 prediction accuracy of 52.0%, and a macro-averaged Area Under the Curve (AUC) of 0.986, with a maximum input token length of 512. This model surpassed the performance of prior leading models in DRG prediction, showing a relative improvement of 40.3% and 35.7% in macro-averaged F1 score compared to ClinicalBERT and CAML, respectively. Applied to base DRG and complication or comorbidity (CC)/major complication or comorbidity (MCC) prediction, achieved a top-1 prediction accuracy of 67.8% and 67.5%, respectively. Additionally, our findings indicate that ’s performance correlates with increased model parameters and input context lengths.

    more » « less
  2. Abstract Objectives

    Respiratory syncytial virus (RSV) is a significant cause of pediatric hospitalizations. This article aims to utilize multisource data and leverage the tensor methods to uncover distinct RSV geographic clusters and develop an accurate RSV prediction model for future seasons.

    Materials and Methods

    This study utilizes 5-year RSV data from sources, including medical claims, CDC surveillance data, and Google search trends. We conduct spatiotemporal tensor analysis and prediction for pediatric RSV in the United States by designing (i) a nonnegative tensor factorization model for pediatric RSV diseases and location clustering; (ii) and a recurrent neural network tensor regression model for county-level trend prediction using the disease and location features.


    We identify a clustering hierarchy of pediatric diseases: Three common geographic clusters of RSV outbreaks were identified from independent sources, showing an annual RSV trend shifting across different US regions, from the South and Southeast regions to the Central and Northeast regions and then to the West and Northwest regions, while precipitation and temperature were found as correlative factors with the coefficient of determination R2≈0.5, respectively. Our regression model accurately predicted the 2022-2023 RSV season at the county level, achieving R2≈0.3 mean absolute error MAE < 0.4 and a Pearson correlation greater than 0.75, which significantly outperforms the baselines with P-values <.05.


    Our proposed framework provides a thorough analysis of RSV disease in the United States, which enables healthcare providers to better prepare for potential outbreaks, anticipate increased demand for services and supplies, and save more lives with timely interventions.

    more » « less
  3. Free, publicly-accessible full text available August 4, 2024
  4. Free, publicly-accessible full text available August 1, 2024
  5. Abstract

    In this work, we aim to accurately predict the number of hospitalizations during the COVID-19 pandemic by developing a spatiotemporal prediction model. We propose HOIST, an Ising dynamics-based deep learning model for spatiotemporal COVID-19 hospitalization prediction. By drawing the analogy between locations and lattice sites in statistical mechanics, we use the Ising dynamics to guide the model to extract and utilize spatial relationships across locations and model the complex influence of granular information from real-world clinical evidence. By leveraging rich linked databases, including insurance claims, census information, and hospital resource usage data across the U.S., we evaluate the HOIST model on the large-scale spatiotemporal COVID-19 hospitalization prediction task for 2299 counties in the U.S. In the 4-week hospitalization prediction task, HOIST achieves 368.7 mean absolute error, 0.6$${R}^{2}$$R2and 0.89 concordance correlation coefficient score on average. Our detailed number needed to treat (NNT) and cost analysis suggest that future COVID-19 vaccination efforts may be most impactful in rural areas. This model may serve as a resource for future county and state-level vaccination efforts.

    more » « less
  6. Deep generative models have enabled the automated synthesis of high-quality data for diverse applications. However, the most effective generative models are specialized to data from a single domain (e.g., images or text). Real-world applications such as healthcare require multi-modal data from multiple domains (e.g., both images and corresponding text), which are difficult to acquire due to limited availability and privacy concerns and are much harder to synthesize. To tackle this joint synthesis challenge, we propose an End-to-end MultImodal X-ray genERative model (EMIXER) for jointly synthesizing x-ray images and corresponding free-text reports, all conditional on diagnosis labels. EMIXER is an conditional generative adversarial model by 1) generating an image based on a label, 2) encoding the image to a hidden embedding, 3) producing the corresponding text via a hierarchical decoder from the image embedding, and 4) a joint discriminator for assessing both the image and the corresponding text. EMIXER also enables self-supervision to leverage vast amount of unlabeled data. Extensive experiments with real X-ray reports data illustrate how data augmentation using synthesized multimodal samples can improve the performance of a variety of supervised tasks including COVID-19 X-ray classification with very limited samples. The quality of generated images and reports are also confirmed by radiologists. We quantitatively show that EMIXER generated synthetic datasets can augment X-ray image classification, report generation models to achieve 5.94% and 6.9% improvement on models trained only on real data samples. Taken together, our results highlight the promise of state of generative models to advance clinical machine learning. 
    more » « less
  7. null (Ed.)
    Existing tensor completion formulation mostly relies on partial observations from a single tensor. However, tensors extracted from real-world data often are more complex due to: (i) Partial observation: Only a small subset of tensor elements are available. (ii) Coarse observation: Some tensor modes only present coarse and aggregated patterns (e.g., monthly summary instead of daily reports). In this paper, we are given a subset of the tensor and some aggregated/coarse observations (along one or more modes) and seek to recover the original fine-granular tensor with low-rank factorization. We formulate a coupled tensor completion problem and propose an efficient Multi-resolution Tensor Completion model (MTC) to solve the problem. Our MTC model explores tensor mode properties and leverages the hierarchy of resolutions to recursively initialize an optimization setup, and optimizes on the coupled system using alternating least squares. MTC ensures low computational and space complexity. We evaluate our model on two COVID-19 related spatio-temporal tensors. The experiments show that MTC could provide 65.20% and 75.79% percentage of fitness (PoF) in tensor completion with only 5% fine granular observations, which is 27.96% relative improvement over the best baseline. To evaluate the learned low-rank factors, we also design a tensor prediction task for daily and cumulative disease case predictions, where MTC achieves 50% in PoF and 30% relative improvements over the best baseline. 
    more » « less
  8. null (Ed.)

    Real-world spatio-temporal data is often incomplete or inaccurate due to various data loading delays. For example, a location-disease-time tensor of case counts can have multiple delayed updates of recent temporal slices for some locations or diseases. Recovering such missing or noisy (under-reported) elements of the input tensor can be viewed as a generalized tensor completion problem. Existing tensor completion methods usually assume that i) missing elements are randomly distributed and ii) noise for each tensor element is i.i.d. zero-mean. Both assumptions can be violated for spatio-temporal tensor data. We often observe multiple versions of the input tensor with different under-reporting noise levels. The amount of noise can be time- or location-dependent as more updates are progressively introduced to the tensor. We model such dynamic data as a multi-version tensor with an extra tensor mode capturing the data updates. We propose a low-rank tensor model to predict the updates over time. We demonstrate that our method can accurately predict the ground-truth values of many real-world tensors. We obtain up to 27.2% lower root mean-squared-error compared to the best baseline method. Finally, we extend our method to track the tensor data over time, leading to significant computational savings.

    more » « less