skip to main content


Search for: machine learning

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Coarse graining techniques play an essential role in accelerating molecular simulations of systems with large length and time scales. Theoretically grounded bottom-up models are appealing due to their thermodynamic consistency with the underlying all-atom models. In this direction, machine learning approaches hold great promise to fitting complex many-body data. However, training models may require collection of large amounts of expensive data. Moreover, quantifying trained model accuracy is challenging, especially in cases of non-trivial free energy configurations, where training data may be sparse. We demonstrate a path towards uncertainty-aware models of coarse grained free energy surfaces. Specifically, we show that principled Bayesian model uncertainty allows for efficient data collection through an on-the-fly active learning framework and opens the possibility of adaptive transfer of models across different chemical systems. Uncertainties also characterize models’ accuracy of free energy predictions, even when training is performed only on forces. This work helps pave the way towards efficient autonomous training of reliable and uncertainty aware many-body machine learned coarse grain models.

     
    more » « less
    Free, publicly-accessible full text available December 1, 2025
  2. Abstract

    The quantification of storm updrafts remains unavailable for operational forecasting despite their inherent importance to convection and its associated severe weather hazards. Updraft proxies, like overshooting top area from satellite images, have been linked to severe weather hazards but only relate to a limited portion of the total storm updraft. This study investigates if a machine learning model, namely, U-Nets, can skillfully retrieve maximum vertical velocity and its areal extent from three-dimensional gridded radar reflectivity alone. The machine learning model is trained using simulated radar reflectivity and vertical velocity from the National Severe Storm Laboratory’s convection permitting Warn-on-Forecast System (WoFS). A parametric regression technique using the sinh–arcsinh–normal distribution is adapted to run with U-Nets, allowing for both deterministic and probabilistic predictions of maximum vertical velocity. The best models after hyperparameter search provided less than 50% root mean squared error, a coefficient of determination greater than 0.65, and an intersection over union (IoU) of more than 0.45 on the independent test set composed of WoFS data. Beyond the WoFS analysis, a case study was conducted using real radar data and corresponding dual-Doppler analyses of vertical velocity within a supercell. The U-Net consistently underestimates the dual-Doppler updraft speed estimates by 50%. Meanwhile, the area of the 5 and 10 m s−1updraft cores shows an IoU of 0.25. While the above statistics are not exceptional, the machine learning model enables quick distillation of 3D radar data that is related to the maximum vertical velocity, which could be useful in assessing a storm’s severe potential.

    Significance Statement

    All convective storm hazards (tornadoes, hail, heavy rain, straight line winds) can be related to a storm’s updraft. Yet, there is no direct measurement of updraft speed or area available for forecasters to make their warning decisions from. This paper addresses the lack of observational data by providing a machine learning solution that skillfully estimates the maximum updraft speed within storms from only the radar reflectivity 3D structure. After further vetting the machine learning solutions on additional real-world examples, the estimated storm updrafts will hopefully provide forecasters with an added tool to help diagnose a storm’s hazard potential more accurately.

     
    more » « less
  3. The growing popularity of Machine Learning (ML) has led to its deployment in various sensitive domains, which has resulted in significant research focused on ML security and privacy. However, in some applications, such as Augmented/Virtual Reality, integrity verification of the outsourced ML tasks is more critical–a face that has not received much attention. Existing solutions, such as multi-party computation and proof-based systems, impose significant computation overhead, which makes them unfit for real-time applications. We propose Fides, a novel framework for real-time integrity validation of ML-as-a-Service (MLaaS) inference. Fides features a novel and efficient distillation technique–Greedy Distillation Transfer Learning–that dynamically distills and fine-tunes a space and compute-efficient verification model for verifying the corresponding service model while running inside a trusted execution environment. Fides features a client-side attack detection model that uses statistical analysis and divergence measurements to identify, with a high likelihood, if the service model is under attack. Fides also offers a re-classification functionality that predicts the original class whenever an attack is identified. We devised a generative adversarial network framework for training the attack detection and re-classification models. The evaluation shows that Fides achieves an accuracy of up to 98% for attack detection and 94% for re-classification. 
    more » « less
    Free, publicly-accessible full text available April 16, 2025
  4. Graph-theoretic algorithms and graph machine learning models are essential tools for addressing many real-life problems, such as social network analysis and bioinformatics. To support large-scale graph analytics, graph-parallel systems have been actively developed for over one decade, such as Google’s Pregel and Spark’s GraphX, which (i) promote a think-like-a-vertex computing model and target (ii) iterative algorithms and (iii) those problems that output a value for each vertex. However, this model is too restricted for supporting the rich set of heterogeneous operations for graph analytics and machine learning that many real applications demand. In recent years, two new trends emerge in graph-parallel systems research: (1) a novel think-like-a-task computing model that can efficiently support the various computationally expensive problems of subgraph search; and (2) scalable systems for learning graph neural networks. These systems effectively complement the diversity needs of graph-parallel tools that can flexibly work together in a comprehensive graph processing pipeline for real applications, with the capability of capturing structural features. This tutorial will provide an effective categorization of the recent systems in these two directions based on their computing models and adopted techniques, and will review the key design ideas of these systems. 
    more » « less
  5. Abstract

    In complex systems with multiple variables monitored at high‐frequency, variables are not only temporally autocorrelated, but they may also be nonlinearly related or exhibit nonstationarity as the inputs or operation changes. One approach to handling such variables is to detrend them prior to monitoring and then apply control charts that assume independence and stationarity to the residuals. Monitoring controlled systems is even more challenging because the control strategy seeks to maintain variables at prespecified mean levels, and to compensate, correlations among variables may change, making monitoring the covariance essential. In this paper, a vector autoregressive model (VAR) is compared with a multivariate random forest (MRF) and a neural network (NN) for detrending multivariate time series prior to monitoring the covariance of the residuals using a multivariate exponentially weighted moving average (MEWMA) control chart. Machine learning models have an advantage when the data's structure is unknown or may change. We design a novel simulation study with nonlinear, nonstationary, and autocorrelated data to compare the different detrending models and subsequent covariance monitoring. The machine learning models have superior performance for nonlinear and strongly autocorrelated data and similar performance for linear data. An illustration with data from a reverse osmosis process is given.

     
    more » « less
  6. Abstract Background

    Parent-of-origin allele-specific gene expression (ASE) can be detected in interspecies hybrids by virtue of RNA sequence variants between the parental haplotypes. ASE is detectable by differential expression analysis (DEA) applied to the counts of RNA-seq read pairs aligned to parental references, but aligners do not always choose the correct parental reference.

    Results

    We used public data for species that are known to hybridize. We measured our ability to assign RNA-seq read pairs to their proper transcriptome or genome references. We tested software packages that assign each read pair to a reference position and found that they often favored the incorrect species reference. To address this problem, we introduce a post process that extracts alignment features and trains a random forest classifier to choose the better alignment. On each simulated hybrid dataset tested, our machine-learning post-processor achieved higher accuracy than the aligner by itself at choosing the correct parent-of-origin per RNA-seq read pair.

    Conclusions

    For the parent-of-origin classification of RNA-seq, machine learning can improve the accuracy of alignment-based methods. This approach could be useful for enhancing ASE detection in interspecies hybrids, though RNA-seq from real hybrids may present challenges not captured by our simulations. We believe this is the first application of machine learning to this problem domain.

     
    more » « less
  7. Lithium-ion batteries (LIBs) are ubiquitous in everyday applications. However, Lithium (Li) is a limited resource on the planet and, therefore, not sustainable. As an alternative to lithium, earth-abundant and cheaper multivalent metals such as aluminum (Al) and calcium (Ca) have been actively researched in battery systems. However, finding suitable intercalation hosts for multivalent-ion batteries is urgently needed. Open-tunneled oxides represent a specific category of microparticles distinguished by the presence of integrated one-dimensional channels or nanopores. This work focuses on two promising open-tunnel oxides: Niobium Tungsten Oxide (NTO) and Molybdenum Vanadium Oxide (MoVO). The MoVO structure can accommodate a larger number of multivalent ions than NTO due to its larger surface area and different shapes. Specifically, the MoVO structure can adsorb Ca, Li, and Al ions with adsorption potentials ranging from around 4 to 5 eV. However, the adsorption potential for hexagonal channels of Al ion drops to 1.73 eV due to the limited channel area. The NTO structure exhibits an insertion/adsorption potential of 4.4 eV, 3.4 eV, and 0.9 eV for one Li, Ca, and Al, respectively. Generally, Ca ions are more readily adsorbed than Al ions in both MoVO and NTO structures. Bader charge analysis and charge density plots reveal the role of charge transfer and ion size in the insertion of multivalent ions such as Ca and Al into MoVO and NTO systems. Exploring open-tunnel oxide materials for battery applications is hindered by vast compositional possibilities. The execution of experimental trials and quantum-based simulations is not viable for addressing the challenge of locating a specific item within a large and complex set of possibilities. Therefore, it is imperative to conduct structural stability testing to identify viable combinations with sufficient pore topologies. Data mining and machine learning techniques are employed to discover innovative transitional metal oxide materials. This study compares two machine learning algorithms, one utilizing descriptors and the other employing graphs to predict the synthesizability of new materials inside a laboratory setting. The outcomes of this study offer valuable insights into the exploration of alternative naturally occurring multiscale particles. 
    more » « less
    Free, publicly-accessible full text available March 25, 2025
  8. Abstract

    Ensuring the long-term chemical durability of glasses is critical for nuclear waste immobilization operations. Durable glasses usually undergo qualification for disposal based on their response to standardized tests such as the product consistency test or the vapor hydration test (VHT). The VHT uses elevated temperature and water vapor to accelerate glass alteration and the formation of secondary phases. Understanding the relationship between glass composition and VHT response is of fundamental and practical interest. However, this relationship is complex, non-linear, and sometimes fairly variable, posing challenges in identifying the distinct effect of individual oxides on VHT response. Here, we leverage a dataset comprising 654 Hanford low-activity waste (LAW) glasses across a wide compositional envelope and employ various machine learning techniques to explore this relationship. We find that Gaussian process regression (GPR), a nonparametric regression method, yields the highest predictive accuracy. By utilizing the trained model, we discern the influence of each oxide on the glasses’ VHT response. Moreover, we discuss the trade-off between underfitting and overfitting for extrapolating the material performance in the context of sparse and heterogeneous datasets.

     
    more » « less
  9. Power flow computations are fundamental to many power system studies. Obtaining a converged power flow case is not a trivial task especially in large power grids due to the non-linear nature of the power flow equations. One key challenge is that the widely used Newton based power flow methods are sensitive to the initial voltage magnitude and angle estimates, and a bad initial estimate would lead to non-convergence. This paper addresses this challenge by developing a random-forest (RF) machine learning model to provide better initial voltage magnitude and angle estimates towards achieving power flow convergence. This method was implemented on a real ERCOT 6102 bus system under various operating conditions. By providing better Newton-Raphson initialization, the RF model precipitated the solution of 2,106 cases out of 3,899 non-converging dispatches. These cases could not be solved from flat start or by initialization with the voltage solution of a reference case. Results obtained from the RF initializer performed better when compared with DC power flow initialization, Linear regression, and Decision Trees. 
    more » « less
    Free, publicly-accessible full text available March 22, 2025
  10. Abstract

    Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

     
    more » « less