NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Combining self-supervision and privileged information for representation learning from tabular data

https://doi.org/10.1007/s10115-025-02418-1

Yang, Haoyu; Steinbach, Michael; Melton, Genevieve; Kumar, Vipin; Simon, Gyorgy (April 2025, Knowledge and Information Systems)

Abstract When building predictive models for real-world applications, many data are discarded because conventional learning algorithms cannot utilize it, although such data could be very informative. This paper focuses on representation learning using two types of additional data: privileged information (PI) and unlabeled data. PI refers to data available only during training but not at test time. Existing methods transfer the knowledge embedded in PI via supervised mechanisms, making them unable to use unlabeled data. In contrast, self-supervised learning methods can use unlabeled data but cannot learn from PI. While these techniques appear complementary, as we demonstrate, combining them is non-trivial. This paper introduces the privileged information regularized (PIReg) self-supervised learning framework, which utilizes both PI and unlabeled data to learn better representations.
more » « less
Towards Entity-Aware Conditional Variational Inference for Heterogeneous Time-Series Prediction: An application to Hydrology

https://doi.org/10.1137/1.9781611978032.38

Ghosh, Rahul; Renganathan, Arvind; McAliley, Wallace; Steinbach, Michael; Duffy, Christopher; Kumar, Vipin (April 2024, SIAM International Conference on Data Mining (SDM24))

Many environmental systems (e.g., hydrology basins) can be modeled as an entity whose response (e.g., streamflow) depends on drivers (e.g., weather) conditioned on their characteristics (e.g., soil properties). We introduce Entity-aware Conditional Variational Inference (EA-CVI), a novel probabilistic inverse modeling approach, to deduce entity characteristics from observed driver-response data. EA-CVI infers probabilistic latent representations that can accurately predict responses for diverse entities, particularly in out-of-sample few-shot settings. EA-CVI's latent embeddings encapsulate diverse entity characteristics within compact, low-dimensional representations. EA-CVI proficiently identifies dominant modes of variation in responses and offers the opportunity to infer a physical interpretation of the underlying attributes that shape these responses. EA-CVI can also generate new data samples by sampling from the learned distribution, making it useful in zero-shot scenarios. EA-CVI addresses the need for uncertainty estimation, particularly during extreme events, rendering it essential for data-driven decision-making in real-world applications. Extensive evaluations on a renowned hydrology benchmark dataset, CAMELS-GB, validate EA-CVI's abilities.
more » « less
Full Text Available
Prescribed Fire Modeling using Knowledge-Guided Machine Learning for Land Management

https://doi.org/10.1137/1.9781611978032.68

Chatterjee, Somya Sharma; Lindsay, Kelly; Chatterjee, Neel; Patil, Rohan; Altintas, Ilkay; Steinbach, Michael; Giron, Daniel; Nguyen, Mai H; Kumar, Vipin (April 2024, SIAM International Conference on Data Mining (SDM24))

In recent years, the increasing threat of devastating wildfires has underscored the need for effective prescribed fire management. Process-based computer simulations have traditionally been employed to plan prescribed fires for wildfire prevention. However, even simplified process models are too compute-intensive to be used for real-time decision-making. Traditional ML methods used for fire modeling offer computational speedup but struggle with physically inconsistent predictions, biased predictions due to class imbalance, biased estimates for fire spread metrics (e.g., burned area, rate of spread), and limited generalizability in out-of-distribution wind conditions. This paper introduces a novel machine learning (ML) framework that enables rapid emulation of prescribed fires while addressing these concerns. To overcome these challenges, the framework incorporates domain knowledge in the form of physical constraints, a hierarchical modeling structure to capture the interdependence among variables of interest, and also leverages pre-existing source domain data to augment training data and learn the spread of fire more effectively. Notably, improvement in fire metric (e.g., burned area) estimates offered by our framework makes it useful for fire managers, who often rely on these estimates to make decisions about prescribed burn management. Furthermore, our framework exhibits better generalization capabilities than the other ML-based fire modeling methods across diverse wind conditions and ignition patterns.
more » « less
Full Text Available
Mini-Batch Learning Strategies for modeling long term temporal dependencies: A study in environmental applications

https://doi.org/10.1137/1.9781611977653.ch73

Xu, Shaoming; Khandelwal, Ankush; Li, Xiang; Jia, Xiaowei; Liu, Licheng; Willard, Jared; Ghosh, Rahul; Cutler, Kelly; Steinbach, Michael; Duffy, Christopher; et al (April 2023, Proceedings of the 2023 SIAM International Conference on Data Mining (SDM))
Shekhar, Shashi; Zhou, Zhi-Hua; Chiang, Yao-Yi; Stiglic, Gregor (Ed.)
In many environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies. However, due to minibatch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered, which can lead to limited performance. Stateful RNNs aim to address this issue by passing hidden states between batches. Since Stateful RNNs ignore intra-batch temporal dependency, there exists a trade-off between training stability and capturing temporal dependency. In this paper, we provide a quantitative comparison of different Stateful RNN modeling strategies, and propose two strategies to enforce both intra- and inter-batch temporal dependency. First, we extend Stateful RNNs by defining a batch as a temporally ordered set of training segments, which enables intra-batch sharing of temporal information. While this approach significantly improves the performance, it leads to much larger training times due to highly sequential training. To address this issue, we further propose a new strategy which augments a training segment with an initial value of the target variable from the timestep right before the starting of the training segment. In other words, we provide an initial value of the target variable as additional input so that the network can focus on learning changes relative to that initial value. By using this strategy, samples can be passed in any order (mini-batch training) which significantly reduces the training time while maintaining the performance. In demonstrating the utility of our approach in hydrological modeling, we observe that the most significant gains in predictive accuracy occur when these methods are applied to state variables whose values change more slowly, such as soil water and snowpack, rather than continuously moving flux variables such as streamflow.
more » « less
Full Text Available
Integrating Scientific Knowledge with Machine Learning for Engineering and Environmental Systems

https://doi.org/10.1145/3514228

Willard, Jared; Jia, Xiaowei; Xu, Shaoming; Steinbach, Michael; Kumar, Vipin (January 2022, ACM Computing Surveys)

There is a growing consensus that solutions to complex science and engineering problems require novel methodologies that are able to integrate traditional physics-based modeling approaches with state-of-the-art machine learning (ML) techniques. This paper provides a structured overview of such techniques. Application-centric objective areas for which these approaches have been applied are summarized, and then classes of methodologies used to construct physics-guided ML models and hybrid physics-ML frameworks are described. We then provide a taxonomy of these existing techniques, which uncovers knowledge gaps and potential crossovers of methods between disciplines that can serve as ideas for future research.
more » « less
Full Text Available
Physics-Guided Machine Learning for Scientific Discovery: An Application in Simulating Lake Temperature Profiles

https://doi.org/10.1145/3447814

Jia, Xiaowei; Willard, Jared; Karpatne, Anuj; Read, Jordan S.; Zwart, Jacob A.; Steinbach, Michael; Kumar, Vipin (May 2021, ACM/IMS Transactions on Data Science)
null (Ed.)
Physics-based models are often used to study engineering and environmental systems. The ability to model these systems is the key to achieving our future environmental sustainability and improving the quality of human life. This article focuses on simulating lake water temperature, which is critical for understanding the impact of changing climate on aquatic ecosystems and assisting in aquatic resource management decisions. General Lake Model (GLM) is a state-of-the-art physics-based model used for addressing such problems. However, like other physics-based models used for studying scientific and engineering systems, it has several well-known limitations due to simplified representations of the physical processes being modeled or challenges in selecting appropriate parameters. While state-of-the-art machine learning models can sometimes outperform physics-based models given ample amount of training data, they can produce results that are physically inconsistent. This article proposes a physics-guided recurrent neural network model (PGRNN) that combines RNNs and physics-based models to leverage their complementary strengths and improves the modeling of physical processes. Specifically, we show that a PGRNN can improve prediction accuracy over that of physics-based models (by over 20% even with very little training data), while generating outputs consistent with physical laws. An important aspect of our PGRNN approach lies in its ability to incorporate the knowledge encoded in physics-based models. This allows training the PGRNN model using very few true observed data while also ensuring high prediction accuracy. Although we present and evaluate this methodology in the context of modeling the dynamics of temperature in lakes, it is applicable more widely to a range of scientific and engineering disciplines where physics-based (also known as mechanistic) models are used.
more » « less
Full Text Available
Predicting diabetes clinical outcomes using longitudinal risk factor trajectories

https://doi.org/10.1186/s12911-019-1009-3

Simon, Gyorgy J.; Peterson, Kevin A.; Castro, M. Regina; Steinbach, Michael S.; Kumar, Vipin; Caraballo, Pedro J. (December 2020, BMC Medical Informatics and Decision Making)

Full Text Available
A new representation of disease conditions and treatment pathways accurately predicts mortality and chronic diseases

Ngufor, Che; Caraballo, Pedro; Byrne, Thomas J.; Chen, David; Shah, Nilay D.; Steinbach, Michael; Simon, Gyorgy (November 2019, AMIA 2019 Annual Symposium)

In this study, we introduce a novel representation of patient data called Disease Severity Hierarchy (DSH) that explores specific diseases and their known treatment pathways in a nested fashion to create subpopulations in a clinically meaningful way. As the DSH tree is traversed from the root towards the leaves, we encounter subpopulations that share increasing richer amounts of clinical details such as similar disease severity, illness trajectories, and time to event that are discriminative, and suitable for learning risk stratification models. The proposed DSH risk scores effectively and accurately predict the age at which a patient may be at risk of dying or developing MCE significantly better than a traditional representation of disease conditions. DSH utilizes known relationships among various entities in EHR data to capture disease severity in a natural way and has the additional benefit of being expressive and interpretable. This novel patient representation can help support critical decision making, development of smart EBP guidelines, and enhance healthcare care and disease management by helping to identify and reduce disease burden among high-risk patients.
more » « less
Full Text Available
Evaluating the Impact of Data Representation on EHR-Based Analytic Tasks

Oh, Wonsuk; Steinbach, Michael S.; Castro, M. Regina; Peterson, Kevin A.; Kumar, Vipin; Caraballo, Pedro J.; Simon, Gyorgy J. (August 2019, Medinfo 2019)

Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics and proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs). Additionally, we studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining. Our evaluation shows that, for regression, most data representations performed similarly, with SEV achieving a slight (albeit statistically significant) advantage. For patients at high risk of diabetes, it outperformed the competing representation by (relative) 20%. For association mining, SEV achieved the highest performance. Its ability to constrain the search space of patterns through clinical knowledge was key to its success.
more » « less
Full Text Available
Process‐Guided Deep Learning Predictions of Lake Water Temperature

https://doi.org/10.1029/2019WR024922

Read, Jordan S.; Jia, Xiaowei; Willard, Jared; Appling, Alison P.; Zwart, Jacob A.; Oliver, Samantha K.; Karpatne, Anuj; Hansen, Gretchen J.; Hanson, Paul C.; Watkins, William; et al (November 2019, Water Resources Research)

Full Text Available

« Prev Next »

Search for: All records