skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Resource Efficient Profiling of Spatial Variability in Performance of Regression Models
Scientists design models to understand phenomena, make predictions, and/or inform decision-making. This study targets models that encapsulate spatially evolving phenomena. Given a model, our objective is to identify the accuracy of the model across all geospatial extents. A scientist may expect these validations to occur at varying spatial resolutions (e.g., states, counties, towns, and census tracts). Assessing a model with all available ground-truth data is infeasible due to the data volumes involved. We propose a framework to assess the performance of models at scale over diverse spatial data collections. Our methodology ensures orchestration of validation workloads while reducing memory strain, alleviating contention, enabling concurrency, and ensuring high throughput. We introduce the notion of a validation budget that represents an upper-bound on the total number of observations that are used to assess the performance of models across spatial extents. The validation budget attempts to capture the distribution characteristics of observations and is informed by multiple sampling strategies. Our design allows us to decouple the validation from the underlying model-fitting libraries to interoperate with models constructed using different libraries and analytical engines; our advanced research prototype currently supports Scikit-learn, PyTorch, and TensorFlow.  more » « less
Award ID(s):
1931363
PAR ID:
10448824
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
2022 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
437 to 444
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Hawaiian Islands have been employed as a model system to reconstruct agroecological extents of traditional Polynesian agricultural production systems. However, the reliability of previously modeled agricultural extents is unknown due to limitations in empirical evidence to assess accuracy. Utilizing a geospatial database of 8,561 archaeological sites compiled by the Hawaiʻi State Historic Preservation Department (SHPD), this research assessed the accuracy and reliability of three spatial models that estimate the extents of traditional Hawaiian agricultural systems. The results of the model sensitivity assessment indicate the three geospatial models capture the spatial patterns and relative extents of intensive agricultural systems with substantial infrastructure, while additional work is needed to assess reliability of modeled agricultural systems with more indefinite infrastructure. 
    more » « less
  2. Abstract Amphiphilic copolymers (AP) represent a class of novel antibiofouling materials whose chemistry and composition can be tuned to optimize their performance. However, the enormous chemistry‐composition design space associated with AP makes their performance optimization laborious; it is not experimentally feasible to assess and validate all possible AP compositions even with the use of rapid screening methodologies. To address this constraint, a robust model development paradigm is reported, yielding a versatile machine learning approach that accurately predicts biofilm formation by Pseudomonas aeruginosa on a library of AP. The model excels in extracting underlying patterns in a “pooled” dataset from various experimental sources, thereby expanding the design space accessible to the model to a much larger selection of AP chemistries and compositions. The model is used to screen virtual libraries of AP for identification of best‐performing candidates for experimental validation. Initiated chemical vapor deposition is used for the precision synthesis of the model‐selected AP chemistries and compositions for validation at solid–liquid interface (often used in conventional antifouling studies) as well as the air–liquid–solid triple interface. Despite the vastly different growth conditions, the model successfully identifies the best‐performing AP for biofilm inhibition at the triple interface. 
    more » « less
  3. Geospatial data collections are now available in a multiplicity of domains. The accompanying data volumes, variety, and diversity of encoding formats within these collections have all continued to grow. These data offer opportunities to extract patterns, understand phenomena, and inform decision making by fitting models to the data. To ensure accuracy and effectiveness, these models need to be constructed at geospatial extents/scopes that are aligned with the nature of decision-making — administrative boundaries such as census tracts, towns, counties, states etc. This entails construction of a large number of models and orchestrating their accompanying resource requirements (CPU, RAM and I/O) within shared computing clusters. In this study, we describe our methodology to facilitate model construction at scale by substantively alleviating resource requirements while preserving accuracy. Our benchmarks demonstrate the suitability of our methodology. 
    more » « less
  4. Abstract Achieving sustainable development requires understanding how human behavior and the environment interact across spatial scales. In particular, knowing how to manage tradeoffs between the environment and the economy, or between one spatial scale and another, necessitates a modeling approach that allows these different components to interact. Existing integrated local and global analyses provide key insights, but often fail to capture ‘meso-scale’ phenomena that operate at scales between the local and the global, leading to erroneous predictions and a constrained scope of analysis. Meso-scale phenomena are difficult to model because of their complexity and computational challenges, where adding additional scales can increase model run-time exponentially. These additions, however, are necessary to make models that include sufficient detail for policy-makers to assess tradeoffs. Here, we synthesize research that explicitly includes meso-scale phenomena and assess where further efforts might be fruitful in improving our predictions and expanding the scope of questions that sustainability science can answer. We emphasize five categories of models relevant to sustainability science, including biophysical models, integrated assessment models, land-use change models, earth-economy models and spatial downscaling models. We outline the technical and methodological challenges present in these areas of research and discuss seven directions for future research that will improve coverage of meso-scale effects. Additionally, we provide a specific worked example that shows the challenges present, and possible solutions, for modeling meso-scale phenomena in integrated earth-economy models. 
    more » « less
  5. Dar, Kamran Shaukat (Ed.)
    Species distribution models (SDMs) are increasingly popular tools for profiling disease risk in ecology, particularly for infectious diseases of public health importance that include an obligate non-human host in their transmission cycle. SDMs can create high-resolution maps of host distribution across geographical scales, reflecting baseline risk of disease. However, as SDM computational methods have rapidly expanded, there are many outstanding methodological questions. Here we address key questions about SDM application, using schistosomiasis risk in Brazil as a case study. Schistosomiasis is transmitted to humans through contact with the free-living infectious stage ofSchistosomaspp. parasites released from freshwater snails, the parasite’s obligate intermediate hosts. In this study, we compared snail SDM performance across machine learning (ML) approaches (MaxEnt, Random Forest, and Boosted Regression Trees), geographic extents (national, regional, and state), types of presence data (expert-collected and publicly-available), and snail species (Biomphalaria glabrata,B.straminea, andB.tenagophila). We used high-resolution (1km) climate, hydrology, land-use/land-cover (LULC), and soil property data to describe the snails’ ecological niche and evaluated models on multiple criteria. Although all ML approaches produced comparable spatially cross-validated performance metrics, their suitability maps showed major qualitative differences that required validation based on local expert knowledge. Additionally, our findings revealed varying importance of LULC and bioclimatic variables for different snail species at different spatial scales. Finally, we found that models using publicly-available data predicted snail distribution with comparable AUC values to models using expert-collected data. This work serves as an instructional guide to SDM methods that can be applied to a range of vector-borne and zoonotic diseases. In addition, it advances our understanding of the relevant environment and bioclimatic determinants of schistosomiasis risk in Brazil. 
    more » « less