As the volume of data and technical complexity of large-scale analysis increases, many domain experts desire powerful computational and familiar analysis interface to fully participate in the analysis workflow by just focusing on individual datasets, leaving the large-scale computation to the system. Towards this goal, we investigate and benchmark a family of Divide-and-Conquer strategies that can help domain experts perform large-scale simulations by scaling up their analysis code written in R, the most popular data science and interactive analysis language. We implement the Divide-and-Conquer strategies that use R as the analysis (and computing) language, allowing advanced users to provide custom R scripts and variables to be fully embedded into the large-scale analysis workflow in R. The whole process will divide large-scale simulations tasks and conquer tasks with Slurm array jobs and R. Simulations and final aggregations are scheduled as array jobs in parallel means to accelerate the knowledge discovery process. The objective is to provide a new analytics workflow for performing similar large-scale analysis loops where expert users only need to focus on the Divide-and-Conquer tasks with the domain knowledge.
more »
« less
Machine Learning-assisted Computational Steering of Large-scale Scientific Simulations
Next-generation scientific applications in various fields are experiencing a rapid transition from traditional experiment-based methodologies to large-scale computation-intensive simulations featuring complex numerical modeling with a large number of tunable parameters. Such model-based simulations generate colossal amounts of data, which are then processed and analyzed against experimental or observation data for parameter calibration and model validation. The sheer volume and complexity of such data, the large model-parameter space, and the intensive computation make it practically infeasible for domain experts to manually configure and tune hyperparameters for accurate modeling in complex and distributed computing environments. This calls for an online computational steering service to enable real-time multi-user interaction and automatic parameter tuning. Towards this goal, we design and develop a generic steering framework based on Bayesian Optimization (BO) and conduct theoretical performance analysis of the steering service. We present a case study with the Weather Research and Forecast (WRF) model, which illustrates the performance superiority of the BO-based tuning over other heuristic methods and manual settings of domain experts using regret analysis.
more »
« less
- Award ID(s):
- 1828123
- PAR ID:
- 10299052
- Date Published:
- Journal Name:
- Proceedings of the 19th IEEE International Symposium on Parallel and Distributed Processing with Applications
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Current research practice for optimizing bioink involves exhaustive experimentation with multi-material composition for determining the printability, shape fidelity and biocompatibility. Predicting bioink properties can be beneficial to the research community but is a challenging task due to the non-Newtonian behavior in complex composition. Existing models such as Cross model become inadequate for predicting the viscosity for heterogeneous composition of bioinks. In this paper, we utilize a machine learning framework to accurately predict the viscosity of heterogeneous bioink compositions, aiming to enhance extrusion-based bioprinting techniques. Utilizing Bayesian optimization (BO), our strategy leverages a limited dataset to inform our model. This is a technique especially useful of the typically sparse data in this domain. Moreover, we have also developed a mask technique that can handle complex constraints, informed by domain expertise, to define the feasible parameter space for the components of the bioink and their interactions. Our proposed method is focused on predicting the intrinsic factor (e.g. viscosity) of the bioink precursor which is tied to the extrinsic property (e.g. cell viability) through the mask function. Through the optimization of the hyperparameter, we strike a balance between exploration of new possibilities and exploitation of known data, a balance crucial for refining our acquisition function. This function then guides the selection of subsequent sampling points within the defined viable space and the process continues until convergence is achieved, indicating that the model has sufficiently explored the parameter space and identified the optimal or near-optimal solutions. Employing this AI-guided BO framework, we have developed, tested, and validated a surrogate model for determining the viscosity of heterogeneous bioink compositions. This data-driven approach significantly reduces the experimental workload required to identify bioink compositions conducive to functional tissue growth. It not only streamlines the process of finding the optimal bioink compositions from a vast array of heterogeneous options but also offers a promising avenue for accelerating advancements in tissue engineering by minimizing the need for extensive experimental trials.more » « less
-
null (Ed.)Recent years have seen a proliferation of ML frameworks. Such systems make ML accessible to non-experts, especially when combined with powerful parameter tuning and AutoML techniques. Modern, applied ML extends beyond direct learning on clean data, however, and needs an expressive language for the construction of complex ML workflows beyond simple pre- and post-processing. We present mlr3pipelines, an R framework which can be used to define linear and complex non-linear ML workflows as directed acyclic graphs. The framework is part of the mlr3 ecosystem, leveraging convenient resampling, benchmarking, and tuning components.more » « less
-
Abstract Developing applicable clinical machine learning models is a difficult task when the data includes spatial information, for example, radiation dose distributions across adjacent organs at risk. We describe the co‐design of a modeling system, DASS, to support the hybrid human‐machine development and validation of predictive models for estimating long‐term toxicities related to radiotherapy doses in head and neck cancer patients. Developed in collaboration with domain experts in oncology and data mining, DASS incorporates human‐in‐the‐loop visual steering, spatial data, and explainable AI to augment domain knowledge with automatic data mining. We demonstrate DASS with the development of two practical clinical stratification models and report feedback from domain experts. Finally, we describe the design lessons learned from this collaborative experience.more » « less
-
Bayesian optimization (BO) has well-documented merits for optimizing black-box functions with an expensive evaluation cost. Such functions emerge in applications as diverse as hyperparameter tuning, drug discovery, and robotics. BO hinges on a Bayesian surrogate model to sequentially select query points so as to balance exploration with exploitation of the search space. Most existing works rely on a single Gaussian process (GP) based surrogate model, where the kernel function form is typically preselected using domain knowledge. To bypass such a design process, this paper leverages an ensemble (E) of GPs to adaptively select the surrogate model fit on-the-fly, yielding a GP mixture posterior with enhanced expressiveness for the sought function. Acquisition of the next evaluation input using this EGP-based function posterior is then enabled by Thompson sampling (TS) that requires no additional design parameters. To endow function sampling with scalability, random feature-based kernel approximation is leveraged per GP model. The novel EGP-TS readily accommodates parallel operation. To further establish convergence of the proposed EGP-TS to the global optimum, analysis is conducted based on the notion of Bayesian regret for both sequential and parallel settings. Tests on synthetic functions and real-world applications showcase the merits of the proposed method.more » « less
An official website of the United States government

